the conversion software registry

32
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign The Conversion Software Registry Michal Ondrejcek, Kenton McHenry, Rob Kooper, Luigi Marini, and Peter Bajcsy

Upload: olaf

Post on 13-Jan-2016

27 views

Category:

Documents


0 download

DESCRIPTION

The Conversion Software Registry. Michal Ondrejcek, Kenton McHenry, Rob Kooper, Luigi Marini, and Peter Bajcsy. Overview. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Conversion Software Registry

National Center for Supercomputing ApplicationsUniversity of Illinois at Urbana-Champaign

The Conversion Software Registry

Michal Ondrejcek, Kenton McHenry, Rob Kooper, Luigi Marini, and Peter Bajcsy

Page 2: The Conversion Software Registry

With an increasing number of file formats used each year preservation of electronic records has become one of the major challenges for the National Archives and Records Administration (NARA).

The Strategic Plan of the National Archives and Records Administration (NARA) 2006-2016, Preserving Past to Protect Future 2006, URL http://www.archives.gov/about/plans-reports/strategic-plan/

Overview

• Why?:• Will there be software to load the file in the future?• If not will the specification for the format still exist?• What would be the best file format conversion in terms of

information preservation? • Was the specification ever available in the case of

closed/proprietary formats to begin with?

This research is partially supported by a National Archive and Records Administration supplement to NSF PACI cooperative agreement CA #SCI-9619019.

2010 MS eScience -1

Page 3: The Conversion Software Registry

Conversions

• Convert files to an open standardized format to store with original

• How and which format?• Conversions often result in some information loss, which format

would have the least?• If we had a “universal” converter we could test conversions

and compare before after files to estimate information loss• How do we convert!?

• MANY file formats!• MANY closed/proprietary formats!• MANY with large/complex specifications

2010 MS eScience -2

Page 4: The Conversion Software Registry

• Many applications to create/view/save 3D content• MANY of them introduce a new file format for that content!

Available 3D File Formats…

2010 MS eScience -3

Page 5: The Conversion Software Registry

• Most applications support a handful of imports and exports

• Perform differently based not only on the algorithm used but also on the purpose of the format domain • emphasis on the texture in 3d• morphology vs. color histogram in 2d

2010 MS eScience -4

Page 6: The Conversion Software Registry

NCSA file conversion technologies

Visualization (I/O Graph)

Conversion (Polyglot)Software Reuse

Closed Source Software

Comparison (Versus)I/O Graph Weights Tool

2010 MS eScience -5

Page 7: The Conversion Software Registry

• Software import and export options are visualized. • I/O Graph chooses the shortest path with the

minimum applications.

Visualization (I/O Graph)

Conversion (Polyglot)Software Reuse

Closed Source Software

Comparison (Versus)I/O Graph Weights Tool

Input/Output Graphs

2010 MS eScience -6

Page 8: The Conversion Software Registry

3DS Max

Adobe 3D Reviewer

AutoCAD

Blender

Cinema 4D

K-3D

LightWave 3D

Maya

Wings 3D

Shortest conversion path

Input/Output Graphs

2010 MS eScience -7

Page 9: The Conversion Software Registry

Software Reuse Layer

Exists for the sole purpose of providing an API interface to functionality in 3rd party software•Controls software via wrapper scripts

• AutoHotkey, AppleScript, various shell scripts• Vision based scripts

• Hides away details of using 3rd party software• Attempts to recover from errors, can throw exceptions

Making Use of 3rd Party Software - We define this as the wrapping of 3rd party software, utilizing whatever interfaces the software vendors have made available, in order to re-introduce an API like interface to embedded functionality.

Visualization (I/O Graph)

Conversion (Polyglot)Software Reuse

Closed Source Software

Comparison (Versus)I/O Graph Weights Tool

2010 MS eScience -8

Page 10: The Conversion Software Registry

• Exists as a service on the machine where the 3rd party software exists

• Clients provide the Java API interface• Many servers can exists on many machines of different

platforms

Software Reuse Layer

2010 MS eScience -9

Page 11: The Conversion Software Registry

The sole purpose of this layer is conversions.•Uses multiple software reuse servers•Merges available script operations into an I/O-Graph•Searches I/O-Graph for conversion paths between an input format and a desired output format•Has no knowledge of underlying 3rd party software•Can use redundancy in software reuse servers to improve performance and work around faults

Visualization (I/O Graph)

Conversion (Polyglot)Software Reuse

Closed Source Software

Comparison (Versus)I/O Graph Weights Tool

Polyglot

2010 MS eScience -10

Page 12: The Conversion Software Registry

Comparison Layer

The sole purpose of this layer is to compare files.•Versus, a framework for pair-wise digital object comparisons. The library extracts the same features from both objects

and computes the similarity based on the chosen measure.•Uses Polyglot layer to convert many test files across many of

the possible paths

A -> B -> A’•Compare files before and after conversion•I/O Graph Weights Tool - Converts a set of files

across many paths using Polyglot and scripts.

Adds information losses obtained from Versus

as edge weights to I/O Graph.

Visualization (I/O Graph)

Conversion (Polyglot)Software Reuse

Closed Source Software

Comparison (Versus)I/O Graph Weights Tool

2010 MS eScience -11

Page 13: The Conversion Software Registry

Conversion Software Registry (CSR)

2010 MS eScience -12

Page 14: The Conversion Software Registry

• http://isda.ncsa.illinois.edu/NARA/CSR • Complementary to format registries such as PRONOM

and GDFR• No similar service that we are aware of.• Community contributions encouraged

A database focused on:•Conversion software!•Finding subsets of software for specific conversion needs•Find conversion paths between pairs of formats

2010 MS eScience -13

Page 15: The Conversion Software Registry

The CSR pseudo-tables block design

Parts: 1) Conversions, 2) Software, 3) Formats and Files,4) Scripts, 5) User login and history

2010 MS eScience -14

Page 16: The Conversion Software Registry

Adding Conversions

2010 MS eScience -15

Page 17: The Conversion Software Registry

Adding Conversions - scripts

Script types present:• Convert - full conversion • Monitor - monitoring software

behavior• Kill - terminating the software • Open/Save/Import/Export

Script headers are standardized with up to four lines with Software name and version, software domain (image, 3d, document, etc.), and input/output formats.

2010 MS eScience -16

Page 18: The Conversion Software Registry

Editing Pane

• Software• Vendors• Software platforms• Interfaces• Formats• Equivalent extensions• Sample files

2010 MS eScience -17

Page 19: The Conversion Software Registry

File formats identifiers and extensions

Canonical and derived identifiers:Common usage ‘TIFF’MIME ‘image/tiff’UTI ‘public.tiff’PRONOM puid ‘fmt/10’

CSR relies on the identifiers.

CSR search by extensions, MIME, PUIDPUID is used for different format versions. For example, a tiff extension is represented as PUID ‘fmt/10’ for the version 6.0, ‘fmt/155’ for GeoTiff. 2010 MS eScience -

18

Page 20: The Conversion Software Registry

Test files

Any file which can be used for conversion accuracy and software validation. The files are uploaded and verified through the UNIX File command and against the file extension entry in the CSR database.

Additional file validation has been performed semi-automatically by NARA using GTRI (Georgia Tech Research Institute) File Type Identifier.

W. Underwood, “Extensions of the UNIX file command and magic file for file type identification”, Technical report ITTL/CSITD 09-02, Georgia Tech Institute, 2009, URL: http://perpos.gtri.gatech.edu/publications/index.htm 2010 MS eScience -

19

Page 21: The Conversion Software Registry

Find a conversion path for converting a file format A to a file format B.

Searching for Software

2010 MS eScience -20

Page 22: The Conversion Software Registry

Searching for Software

Find a conversion path for converting a file format A to a file format B.

2010 MS eScience -20

Page 23: The Conversion Software Registry

Searching for Software

Find a conversion path for converting a file format A to a file format B.

2010 MS eScience -21

Page 24: The Conversion Software Registry

• Dijkstra's algorithm - path with lowest cost (e.g. the shortest path) between one vertex/node and every other vertex with edges defined by some measure

• Subjective measure - software ranking by user propagates to all conversions.

• Quantitative measures within the domain (images, 3d etc.).• Images: Normalized cross correlation measure, Histogram

distance measure,• 3d: Surface area, Statistics, Spin images, Light fields• Document (pdf)• Audio

Shortest path from file A to B

User specified measures for example a linear combination of measures.

2010 MS eScience -22

Page 25: The Conversion Software Registry

Searching for Conversion Paths

2010 MS eScience -23

Page 26: The Conversion Software Registry

Searching for Conversion Paths

2010 MS eScience -23

Page 27: The Conversion Software Registry

Searching for Conversion Paths

2010 MS eScience -23

Page 28: The Conversion Software Registry

Searching for Conversion Paths

2010 MS eScience -23

Page 29: The Conversion Software Registry

Searching for Conversion Paths

2010 MS eScience -23

Page 30: The Conversion Software Registry

Searching for Conversion Paths

2010 MS eScience -23

Page 31: The Conversion Software Registry

Future Directions

• Compiling known “good” data of various formats• Systematically measuring information loss across

software and formats• Possibly distributing task among a community• Ranking software based on performance

• Integration of CSR and Polyglot.

2010 MS eScience -24

Page 32: The Conversion Software Registry

Summary

• Currently contains 2,006 software packages • 1,682 format extensions• 233,810 conversions

• No similar service that we are aware of• Complementary to format registries such as PRONOM

and GDFR• Free• Community contributions encouraged

http://isda.ncsa.illinois.edu/NARA/CSR

2010 MS eScience -25