characterisation - 101. an introduction to the identification and characterisation of file formats...

15
SCAPE Carl Wilson Open Planets Foundation SCAPE Training Guimarães Characterisation - 101 An introduction to the identification and characterisation of file formats. This work was partially supported by the SCAPE Project. The SCAPE project is co-funded by the European Union under FP7 ICT-2009.4.1 (Grant Agreement number 270137).

Upload: scape-project

Post on 05-Dec-2014

427 views

Category:

Technology


1 download

DESCRIPTION

This is an introduction to the identification and characterization of file formats and which tools can be used for this. The intro was given by Carl Wilson from Open Planets Foundation at the first SCAPE Training event, ‘Keeping Control: Scalable Preservation Environments for Identification and Characterisation’, in Guimarães, Portugal on 6-7 December 2012.

TRANSCRIPT

Page 1: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE

Carl Wilson Open Planets Foundation

SCAPE Training Guimarães

Characterisation - 101 An introduction to the identification and characterisation of file formats.

This work was partially supported by the SCAPE Project. The SCAPE project is co-funded by the European Union under FP7 ICT-2009.4.1 (Grant Agreement number 270137).

Page 2: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE About Us

• Carl Wilson Open Planets Foundation [email protected] http://www.openplanetsfoundation.org

• SCAPE Project EU funded research project SCAlable Preservation Environments http://www.scape-project.eu

2

Page 3: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE About You

• Once Around The Room • Name • Where you work • What you do • Why you’re here

• DO Ask Questions • Or tell me to slow down… • Or ask me to repeat something…

3

Page 4: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE File Formats

• What is a File Format? • A “standard” method of encoding data for

storage. • May be to an open specification • OR a proprietary one, open preferred • Or simply following a loosely documented

convention

4

Page 5: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE Who Cares About Formats?

• Operating Systems: in order to open a file with an application that can interpret /render it.

• Web Servers: to negotiate Content-Type in HTTP requests

• Memory Institutions: to identify software stacks that can render or extract meaning from a file, now or at a later date.

• More Generally: everyone with digital content, whether they know it or not.

5

Page 6: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE Some Uses of Format Information

• Format Information: • Associates a file with software that can

interpret and/or render its contents • Can be used to find documentation /

specifications to help interpret a file’s contents • Is a first step to preservation planning, knowing

what you have……

6

Page 7: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE File Name Extension

• A file name suffix separated by a dot “.”, from the file base name.

• Examples: .pdf, .txt, .jpg, .doc, .docx • This has worked for a number of years BUT

• Any user with the right permission can change a file extension

• Bytes aren’t always transferred with a name

7

Page 8: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE Internet Media (MIME) Types

• The format identifiers used by the web • Examples:

• text/plain • text/html • image/jpg

• Don’t readily hold extra information such as format version, but may be extended.

8

Page 9: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE Apple’s Alternatives

• Pre OS-X versions of MAC OS used Creator and Type codes • Creator: The software that created the file • Type: The type of information, e.g. TEXT • More flexible than extension, but no longer

used

• Recent OS-X versions also use Uniform Type Identifiers

9

Page 10: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE PRONOM Unique Identifiers or PUIDs

• PRONOM is a web based registry of file format information

• Created and Hosted by the National Archives of the UK in 2002

• Uses PUIDS to identify file formats: • fmt/15 == Acrobat PDF 1.1 • fmt/16 == Acrobat PDF 1.2 • fmt/17 == Acrobat PDF 1.3

10

Page 11: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE The Unix File Utility

• A standard Unix program for identifying the data in a file.

• First released in 1973, written in C so requires Operating System dependent compilation

• Open source version used in Linux distributions written in 1986

• Identification based upon compiled “magic” files • Provides text information about files, or MIME

types with the right options 11

Page 12: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE FIDO

• Format Identification of Digital Objects • Open Source format identification tools • Based upon the PRONOM signature data

compiled to regular expressions • Written in Python so can be run on different

Operating Systems • Richer command line syntax than DROID

12

Page 13: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE Apache Tika

• Open Source toolkit for detecting and extracting metadata and structured text from files

• Performs Format Identification and deeper characterisation (more on that later).

• Java based so will run on different platforms. • Returns MIME types as format identifiers

13

Page 14: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE How Do These Tools Identify Formats?

• They exploit “common features” of the format. • PDF start of file:

• %PDF-1.1 PDF Version 1.1 • %PDF-1.2 PDF Version 1.2 • %PDF-1.6 PDF Version 1.6

• Tika and File simply look for files starting with the string %PDF- and return the MIME type

• FIDO However……

14

Page 15: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE FIDO & PDF Identification

• FIDO identifies the different PDF versions, each of which have a PUID

• FIDO also looks for an END OF FILE marker for PDFs : .%%EOF.

• This could be a problem…….

15