characterisation - 101. an introduction to the identification and characterisation of file formats...

Post on 05-Dec-2014

427 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

This is an introduction to the identification and characterization of file formats and which tools can be used for this. The intro was given by Carl Wilson from Open Planets Foundation at the first SCAPE Training event, ‘Keeping Control: Scalable Preservation Environments for Identification and Characterisation’, in Guimarães, Portugal on 6-7 December 2012.

TRANSCRIPT

SCAPE

Carl Wilson Open Planets Foundation

SCAPE Training Guimarães

Characterisation - 101 An introduction to the identification and characterisation of file formats.

This work was partially supported by the SCAPE Project. The SCAPE project is co-funded by the European Union under FP7 ICT-2009.4.1 (Grant Agreement number 270137).

SCAPE About Us

• Carl Wilson Open Planets Foundation carl@openplanetsfoundation.org http://www.openplanetsfoundation.org

• SCAPE Project EU funded research project SCAlable Preservation Environments http://www.scape-project.eu

2

SCAPE About You

• Once Around The Room • Name • Where you work • What you do • Why you’re here

• DO Ask Questions • Or tell me to slow down… • Or ask me to repeat something…

3

SCAPE File Formats

• What is a File Format? • A “standard” method of encoding data for

storage. • May be to an open specification • OR a proprietary one, open preferred • Or simply following a loosely documented

convention

4

SCAPE Who Cares About Formats?

• Operating Systems: in order to open a file with an application that can interpret /render it.

• Web Servers: to negotiate Content-Type in HTTP requests

• Memory Institutions: to identify software stacks that can render or extract meaning from a file, now or at a later date.

• More Generally: everyone with digital content, whether they know it or not.

5

SCAPE Some Uses of Format Information

• Format Information: • Associates a file with software that can

interpret and/or render its contents • Can be used to find documentation /

specifications to help interpret a file’s contents • Is a first step to preservation planning, knowing

what you have……

6

SCAPE File Name Extension

• A file name suffix separated by a dot “.”, from the file base name.

• Examples: .pdf, .txt, .jpg, .doc, .docx • This has worked for a number of years BUT

• Any user with the right permission can change a file extension

• Bytes aren’t always transferred with a name

7

SCAPE Internet Media (MIME) Types

• The format identifiers used by the web • Examples:

• text/plain • text/html • image/jpg

• Don’t readily hold extra information such as format version, but may be extended.

8

SCAPE Apple’s Alternatives

• Pre OS-X versions of MAC OS used Creator and Type codes • Creator: The software that created the file • Type: The type of information, e.g. TEXT • More flexible than extension, but no longer

used

• Recent OS-X versions also use Uniform Type Identifiers

9

SCAPE PRONOM Unique Identifiers or PUIDs

• PRONOM is a web based registry of file format information

• Created and Hosted by the National Archives of the UK in 2002

• Uses PUIDS to identify file formats: • fmt/15 == Acrobat PDF 1.1 • fmt/16 == Acrobat PDF 1.2 • fmt/17 == Acrobat PDF 1.3

10

SCAPE The Unix File Utility

• A standard Unix program for identifying the data in a file.

• First released in 1973, written in C so requires Operating System dependent compilation

• Open source version used in Linux distributions written in 1986

• Identification based upon compiled “magic” files • Provides text information about files, or MIME

types with the right options 11

SCAPE FIDO

• Format Identification of Digital Objects • Open Source format identification tools • Based upon the PRONOM signature data

compiled to regular expressions • Written in Python so can be run on different

Operating Systems • Richer command line syntax than DROID

12

SCAPE Apache Tika

• Open Source toolkit for detecting and extracting metadata and structured text from files

• Performs Format Identification and deeper characterisation (more on that later).

• Java based so will run on different platforms. • Returns MIME types as format identifiers

13

SCAPE How Do These Tools Identify Formats?

• They exploit “common features” of the format. • PDF start of file:

• %PDF-1.1 PDF Version 1.1 • %PDF-1.2 PDF Version 1.2 • %PDF-1.6 PDF Version 1.6

• Tika and File simply look for files starting with the string %PDF- and return the MIME type

• FIDO However……

14

SCAPE FIDO & PDF Identification

• FIDO identifies the different PDF versions, each of which have a PUID

• FIDO also looks for an END OF FILE marker for PDFs : .%%EOF.

• This could be a problem…….

15

top related