Transcript
Page 1: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE

Carl Wilson Open Planets Foundation

SCAPE Training Guimarães

Characterisation - 101 An introduction to the identification and characterisation of file formats.

This work was partially supported by the SCAPE Project. The SCAPE project is co-funded by the European Union under FP7 ICT-2009.4.1 (Grant Agreement number 270137).

Page 2: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE About Us

• Carl Wilson Open Planets Foundation [email protected] http://www.openplanetsfoundation.org

• SCAPE Project EU funded research project SCAlable Preservation Environments http://www.scape-project.eu

2

Page 3: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE About You

• Once Around The Room • Name • Where you work • What you do • Why you’re here

• DO Ask Questions • Or tell me to slow down… • Or ask me to repeat something…

3

Page 4: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE File Formats

• What is a File Format? • A “standard” method of encoding data for

storage. • May be to an open specification • OR a proprietary one, open preferred • Or simply following a loosely documented

convention

4

Page 5: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE Who Cares About Formats?

• Operating Systems: in order to open a file with an application that can interpret /render it.

• Web Servers: to negotiate Content-Type in HTTP requests

• Memory Institutions: to identify software stacks that can render or extract meaning from a file, now or at a later date.

• More Generally: everyone with digital content, whether they know it or not.

5

Page 6: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE Some Uses of Format Information

• Format Information: • Associates a file with software that can

interpret and/or render its contents • Can be used to find documentation /

specifications to help interpret a file’s contents • Is a first step to preservation planning, knowing

what you have……

6

Page 7: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE File Name Extension

• A file name suffix separated by a dot “.”, from the file base name.

• Examples: .pdf, .txt, .jpg, .doc, .docx • This has worked for a number of years BUT

• Any user with the right permission can change a file extension

• Bytes aren’t always transferred with a name

7

Page 8: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE Internet Media (MIME) Types

• The format identifiers used by the web • Examples:

• text/plain • text/html • image/jpg

• Don’t readily hold extra information such as format version, but may be extended.

8

Page 9: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE Apple’s Alternatives

• Pre OS-X versions of MAC OS used Creator and Type codes • Creator: The software that created the file • Type: The type of information, e.g. TEXT • More flexible than extension, but no longer

used

• Recent OS-X versions also use Uniform Type Identifiers

9

Page 10: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE PRONOM Unique Identifiers or PUIDs

• PRONOM is a web based registry of file format information

• Created and Hosted by the National Archives of the UK in 2002

• Uses PUIDS to identify file formats: • fmt/15 == Acrobat PDF 1.1 • fmt/16 == Acrobat PDF 1.2 • fmt/17 == Acrobat PDF 1.3

10

Page 11: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE The Unix File Utility

• A standard Unix program for identifying the data in a file.

• First released in 1973, written in C so requires Operating System dependent compilation

• Open source version used in Linux distributions written in 1986

• Identification based upon compiled “magic” files • Provides text information about files, or MIME

types with the right options 11

Page 12: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE FIDO

• Format Identification of Digital Objects • Open Source format identification tools • Based upon the PRONOM signature data

compiled to regular expressions • Written in Python so can be run on different

Operating Systems • Richer command line syntax than DROID

12

Page 13: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE Apache Tika

• Open Source toolkit for detecting and extracting metadata and structured text from files

• Performs Format Identification and deeper characterisation (more on that later).

• Java based so will run on different platforms. • Returns MIME types as format identifiers

13

Page 14: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE How Do These Tools Identify Formats?

• They exploit “common features” of the format. • PDF start of file:

• %PDF-1.1 PDF Version 1.1 • %PDF-1.2 PDF Version 1.2 • %PDF-1.6 PDF Version 1.6

• Tika and File simply look for files starting with the string %PDF- and return the MIME type

• FIDO However……

14

Page 15: Characterisation - 101. An introduction to the identification and characterisation of file formats – SCAPE Training event, Guimarães 2012

SCAPE FIDO & PDF Identification

• FIDO identifies the different PDF versions, each of which have a PUID

• FIDO also looks for an END OF FILE marker for PDFs : .%%EOF.

• This could be a problem…….

15


Top Related