i
Attribute Grammars for Validating
Chunk-based Binary File Formats
William Underwood
Sandra Laib
ICLITDSD Working Paper 11-03
July 2011
Information Technology and Decision Support Division
Information and Communications Laboratory
Georgia Tech Research Institute
Atlanta Georgia
Research was sponsored by The Army Research Laboratory (ARL) and the Applied Research
Division of the National Archives and Records Administration (NARA) and was accomplished
under Cooperative Agreement Number W911NF-10-2-0030 The views and conclusions
contained in this document are those of the authors and should not be interpreted as representing
the official policies either expressed or implied of the Army Research Laboratory NARA or
the US Government The US government is authorized to reproduce and distribute reprints for
Government purposes notwithstanding any copyright notation heron
ii
ABSTRACT
The capability to validate and view or play binary file formats as well as to convert binary file
formats to standard or current file formats is critically important to the preservation of digital
data and records
Context-free grammars are used for defining programming languages There are established
methods for using these grammars to check the syntax of programs and to compile programs into
assembly or machine language programs Grammars have also been used to define page
description languages and vector graphics languages However current technologies for context-
free grammars syntax checkers and compilers are limited to string languages or string data
types
This report describes the extension of context-free grammars from strings to binary files Binary
files are arrays of data types such as long and short integers floating-point numbers and pointers
as well as characters The concept of an attribute grammar is extended to these context-free array
grammars Attribute grammars have been used to define a number of chunk-based file formats
A parser generator has been used with some of these grammars to generate syntax checkers
(recognizers) for validating binary file formats The significance of these results is that with
these extensions to core computer science concepts traditional parsercompiler technologies can
be used as a part of a cost effective preservation strategy for binary file formats
iii
TABLE OF CONTENTS
1 INTRODUCTION 1
11 BACKGROUND 1
12 PURPOSE 1
13 SCOPE 1
2 BINARY FILE FORMATS 1
3 CONTEXT-FREE GRAMMARS AND ATTRIBUTE GRAMMARS 3
31 CONTEXT-FREE GRAMMARS FOR ARRAYS OF DATA TYPES 3
32 ATTRIBUTE GRAMMARS 4
33 SEMANTIC GRAMMAR 5
4 GRAMMARS FOR CHUNK-BASED BINARY FILE FORMATS 6
41 INTERLEAVED BITMAP (ILBM) 6
42 RIFF WAVEFORM PCM AUDIO FILE FORMAT (WAVE) 9
5 RECURSIVE DESCENT PARSERS FOR BINARY FILE GRAMMARS 10
6 A PARSER GENERATOR FOR RECOGNIZING BINARY FILE FORMATS 11
61 ANTLR 11
62 USING ANTLR TO GENERATE A PARSER FOR A BINARY FILE FORMAT 11
621 An ANTLR Grammar for an ILBM Chunk-based File Format 11
622 The Lexer and Parser 16
63 AN ANTLR GRAMMAR FOR THE RIFF WAVEFORM PCM FILE FORMAT 18
7 RELATED RESEARCH 20
8 CONCLUSION 21
REFERENCES 24
APPENDIX A CHUNK-BASED BINARY FILE FORMATS 33
A1 ELECTRONIC ARTS INTERCHANGE FILE FORMAT (IFF) 33
A11 Interleaved Bitmap (ILBM) 33
A12 8-bit Sampled Voice (8SVX) 33
A13 Cel Animation (ANIM) 34
A14 IFF Formatted Text (FTXT) 35
A15 IFF Simple Musical Score (SMUS) 35
A16 Planar Bitmap (PBM) 35
A2 STANDARD MIDI FILE FORMAT 35
A3 AUDIO INTERCHANGE FILE FORMAT 35
A4 RESOURCE INTERCHANGE FILE FORMATS 37
A41 RIFF-Waveform (WAVE) 37
A42 WAVEFORMATEX 37
A43 WAVEFORMATEXTENSIBLE 37
A44 Broadcast Wave Format 37
iv
A45 Windows Audio-Video Interleave (AVI) 37
A46 Animated Windows Cursor (ANI) 37
A47 RIFF MIDIfile (RMID) 37
A48 Device-Independent Bitmap (RIFF RDIB) 37
A49 WebP 37
A5 JPEG 38
A51 JPEGJFIF 38
A52 JPEGExif 38
A53 JPEG 2000 39
A54 JPEG 2000 Codestream 39
A6 ADVANCED SYSTEMS FORMAT 39
A61 Windows Media Audio 7-9 39
A62 Windows Media Video 7-91 39
A7 PORTABLE NETWORK GRAPHICS 39
A71 Multiple-image Network Graphics 40
A72 JPEG Network Graphics 40
A8 BINARY INTERCHANGE FILE FORMAT 40
A9 3D STUDIO 40
A91 3D Studio File Format ndash 3ds 40
A92 3D Studio Material Library Files 40
A10 ANIM8OR 40
A11 ANIMATOR 40
A111 Animator Cel 40
A112 Animator Pro Fli 40
A113 Animator Pro fLC 40
A111 Animator Cel 40
A114 Animator Pro PIC 40
A12 AUDIO MANAGER MODULE 41
A13 AUTODESK DRAWING INTERCHANGE FILES (DXF) (BINARY) 41
A14 CALIGARI TRUESPACE SCENEOBJECT (BINARY) 41
A15 CORELDRAW VECTOR GRAPHICS FILE 41
A16 COREL METAFILE EXCHANGE IMAGE 41
A17 MUSIC MODULES 41
A171 DigiTrekker Module Format 41
A172 Graoumf Tracker modules 41
A173 Oktalyzer 41
A18 EA MULTIMEDIA GAME FILE FORMATS 41
A181 Electronic Arts Music (asf) 41
A182 Electronic Arts CMV 41
A183 Electronic Arts DCT 41
A184 Electronic Arts MAD 42
A185 Electronic Arts MPC 42
A186 Electronic Arts VP7 42
A187 Electronic Arts TGV 42
A188 Electronic Arts TGQ 42
v
A189 Electronic Arts TQI 42
A19 IMAGINE 3D DATA DESCRIPTION (TDDD) 42
A20 LIGHTWAVE 3D OBJECTS 42
A201 LWOB 42
A202 LWO2 42
A203 LWLO 42
A21 LUXOLOGY MODO 3D OBJECT (LXOB) 42
A22 PAINT SHOP PRO (PSP) 42
A23 PROVECTOR 2D DRAWING 43
A24 PROWRITE DOCUMENT 43
A25 INTERACTIVEFICTION 43
A251 Blorb Z-machine Resource File 43
A252 Quetzal 43
A26 APPLE QUICKTIME 43
A27 MAYA IFF BITMAP 43
A28 APPLE CORE AUDIO FORMAT 43
A29 AMERICAN LASER GAMES (MM) 43
A30 SMUSH ANIMATION FORMAT 43
A301 SAN 43
A302 SNM 43
A31 STRUCTURED DATA EXCHANGE FORMAT (SDXF) 43
A32 OGG VORBIS 44
A33 SIMPLE VECTOR ARRAY FORMAT 44
A34 NESTED LIST FILE 44
A35 CINEMA 4D 44
A36 THE SIMS INTERCHANGE FILE FORMAT 44
vi
List of Figures
Figure 1 File format (Layout) of a RIFF container for a Webp Image 2
Figure 2 ILBM File hampic Displayed 7
Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format 7
Figure 4 A Parse Tree for the ILBM File hampic 8
Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format 10
Figure 6 An ANTLR Grammar for the ILBM File Format 16
Figure 7 A Java Application for Parsing an ILBM File 17
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF 18
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format 20
1
1 Introduction
11 Background
Automated tools are required for identifying and validating the formats of the huge number of
files ingested into digital data and record archives Automated tools are also needed for
viewingplaying text and binary file formats and for converting legacy and obsolete file formats
to standard or current formats Technologies such as data description languages have emerged to
address these preservation challenges [Dunckley et al 2007]
Context-free grammars have been used to specify the syntax of programming languages
Attribute grammars provide a framework for formally specifying the semantics of a language
based on its context-free grammar and for addressing the mildly context-sensitive features of
programming languages such as agreement of the data types of variables in expressions with data
type declarations Such grammars have parsingtranslation algorithms that can be used for
syntax-checking and interpretation or translation of the languages the grammars define
12 Purpose
The research question addressed by this research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files
13 Scope
The next section discusses the traditional approach to specifying file formats and two of the
major families of binary file formats In section 3 extensions to the concepts of context-free
grammars and attribute grammars are described that enable the specification of binary file
formats In section 4 examples of attribute array grammars for a chunk-based binary file formats
are presented In section 5 recursive descent parsers for these classes of grammars are described
In section 6 experience in using ANTLR a parser generator for LL() string grammars in
generating parsers for recognizing the formats of binary files is discussed In section 7 related
research is described Section 8 summarizes the results
2 Binary File Formats
Traditionally a file (or record) layout was a form showing how fields were positioned in a file
(or record) The fields are named and have as attributes a data type length and sometimes a
constant value Fields are offset at addresses relative to the beginning of a file File formats are
still specified in this manner For instance Figure 1 shows the format layout of a RIFF container
for the VP8 codec of a Webp image [Google 2010]
2
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk starting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
Figure 1 File format (Layout) of a RIFF container for a Webp Image
Many binary file formats are specified using pseudo-regular expressions or pseudo-EBNF
notation EBNF (Extended Backus-Naur Form) is a notation for expressing context-free
grammars in a compact human readable way Pseudo-EBNF is similar in concept to pseudocode
which is a high-level description of a computer algorithm that is intended for human
understanding but that omits details that would be necessary for computer execution The
pseudo-EBNF (or regular expression) specification of a file format is augmented with a natural
language description of the details that are not actually expressible in the EBNF (or regular
expression) notation These details are often the context-sensitive relationships of the size of an
array to its actual length or the relationship of an address pointer to the actual location of the
data pointed to in the file These context-sensitive relationships are not expressible in a context-
free grammar (or EBNF notation) It is a goal of this research to extend the EBNF notation and
concept of a context-free grammar to include the details that are necessary to precisely specify
the file formats of binary file formats that include these context-sensitive features
Binary file formats with a similar file structure are referred to as a family of file formats There
are two families of binary file formats that are readily distinguished the chunk-based and the
directory-based file formats
Chunk-based file formats were created by Electronic Arts and Commodore-Amiga as the
Interchange File Format (IFF) [Morrison 1985] An IFF file itself is one entire IFF chunk A
chunk consists of an ID tag a size and size bytes of data The data may include chunks called
subchunks Subchunks have the same structure as chunks Subchunks can have subchunks
Chunk-based file formats are primarily used as containers for multimedia but have also been
used for word processing files including pictures and other figures File format specifications for
chunk-based formats may use other terms than chunks in describing the format for instance
atoms in QuickTimeMP4 segments in JPEG and ―tagged data representations as in
AutoCAD DXF
3
The Audio Interchange File Format (AIFF) developed by Apple computer in 1988 was based on
Electronic Arts Interchange File Format Apple Core Audio Format (CAF) Applelsquos replacement
format for AIFF remains a chunk-based format [Apple Computer 2005] The Resource
Interchange File Format (RIFF) introduced by IBM and Microsoft [1991] is also based on
Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-Video
Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis WebP a picture format
recently introduced by Google[2010] also uses RIFF as a container
Microsoftlsquos Advanced Systems Format (ASF) [2004] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV) Microsoftlsquos Binary Interchange File Format [Rentz 2008]
(Microsoft Excellsquos File Format) is chunk-based
Another family of binary file formats is directory-based A directory-based file format consists of
one or more directory tables which contain one or more directory entries A directory entry
specifies where the actual data for a type of information is located This is the scheme used in
TIFF files OLE (Microsoft Object Linking and Embedding) files OASIS OpenDocument and
Microsoft Open Office files
3 Context-Free Grammars and Attribute Grammars
Context-free grammars have been widely used to define programming languages A context-free
grammar G is a quadruple ltN Σ S Pgt where
N is a finite set of non-terminal symbols
Σ is a finite set of terminal symbols
S ε N is the start symbol
P is a set of production rules of the form N rarr N Σ
The set of all strings over a vocabulary Σ is denoted by Σ The language generated by a context-
free grammar G is the set L(G) = w w Σ and S w
31 Context-Free Grammars for Arrays of Data Types
An array is a data structure consisting of a collection of elements (values or variables) each
identified by an index An array is stored so that the position of each element can be computed
from its index For example an array of 10 integer variables with indices 0 through 9 may be
4
stored as 10 (4 byte) words at file addresses 0 4 8 hellip36 so that the element with index i has the
address 4 times i
Arrays are used to implement many other data structures such as lists and strings In most
modern computers the internal memory and external memory of storage devices (and files on
those devices) is a one-dimensional array of data types whose indices are their addresses
A data type is a classification of one of various types of data such as floating-point integer
character pointer or Boolean that determines the possible values for that type the operations
that can be performed on values of that type and the way values of that type can be stored Data
types have names such as int16 int32 float char bool ptr We define a binary file to be an array
of values of data of various types
We define a context-free binary file grammar BG as a quintuple ltN D Σ S Pgt where
N is a finite set of non-terminal symbols
D is a set of data types
Σ is a finite set of binary values of data types D called terminals
S ε N is the start symbol
P is a set of production rules of the form N rarr N Σ
Let DataTypes indicate the union of all the values of all datatypes D The set of all binary files is
BinaryFiles = [IndicesrarrDataTypes] where Indices = 1 The language generated by a
context-free array (binary file) grammar AG is the set L(AG) = w w BinaryFiles and S w
By primitive nonterminal is meant a symbol which would be replaced by one or more terminal
symbols were an additional production rule applied In other words a primitive nonterminal is a
nonterminal that appears on the left-hand side of productions which only have one or more
terminal symbols on the right We adopt the notation shown below for primitive nonterminals
ltprimitive_nonterminal_symbol type=datatype_name constant=valuegt
32 Attribute Grammars
Context-free grammars cannot represent context-sensitive aspects of programming languages
such as (1) enforcing the constraint that all variables are declared before they are used or (2)
checking the number of parameters in a function call against the number in the functionlsquos
declaration Context-free grammars have also been used to define the syntax of English and other
natural languages but they cannot define such context-sensitivity as subject verb agreement in
English sentences A number of extensions of context-free grammars have been proposed to
5
address the context-sensitivity of programming languages and natural languages These include
indexed [Aho 1968] recording [Barth 1979] affix [Koster 1971] VW [van Wijngaarden 1974]
and attribute grammars [Knuth 1969 1971]
Context-free grammars also cannot represent the semantics of programming languages Knuth
[1968 1971] proposed an extension of context-free grammars termed attribute grammars that
addresses the semantics as well as the context-sensitivity of programming languages
An attribute grammar GA is a triple ltG A ARgt where G is a context-free grammar for the
language A associates each grammar symbol X isin (N Σ) with a set of attributes and AR
associates each production R isin P with a set of attribute computation rules A(X) where X isin (N
Σ) can be further partitioned into two sets synthesized attributes S(X) and inherited attributes
I(X) AR(R) where R isin P contains rules for computing inherited and synthesized attributes
associated with the symbols in the production R Synthesized attributes are evaluated and passed
up from deeper terminals and non-terminals while inherited attributes are passed down from
upper non-terminals
An L-attributed grammar is an attribute grammar that uses
i) synthesized attributes and
ii) inherited attributes where the inherited attribute depends only on
a) attributes of the other child nodes to its left
b) inherited attributes of P itself
L-attributed grammars allow the attributes to be evaluated in one left-to-right traversal of the
abstract syntax tree As a result attribute evaluation in L-attributed grammars can be
incorporated conveniently in top-down parsing
33 Semantic Grammar
An interpreter for a binary file format that is using a grammar to display or play the contents of a
file or to convert it to another file format needs to produce a semantic representation that is
appropriate to the task Natural language understanding components of dialog systems face a
similar need to produce a semantic representation that is appropriate to the dialogue task To
address this need some developers of dialogue systems make use of an approach called semantic
grammar [Issar and Ward 1993 Ward and Issar 1994] A semantic grammar is a CFG in which
the non-terminals on the left-hand side of rules correspond to semantic entities A parser for a
semantic grammar produces a labeling of the input string with semantic node labels for example
SHOW FLIGHTS ORIGIN DESTINATION DEPART_DATE DEPART_TIME Show me flights from Boston to San Francisco on Tuesday morning
6
A semantic grammar is just what we need to file in the values of a programming language data
structure Semantic grammars are not applicable to generic systems but are limited to specific
domains This is not a problem in specifying or parsing a binary file format since a grammar for
a binary file format is not meant to apply to all binary file formats but just to a single format or
to a family of file formats
4 Grammars for Chunk-based Binary File Formats
The attribute grammar that we present will be a mixture of synthesized semantic rules (attribute
values of the left side of a production are calculated from attribute values of symbols on the right
side) and inherited rules (attribute values of symbols on the right side of a production are
calculated from attribute values of the symbol on the left side)
The field names of the file format spec are represented as nonterminals in the grammar Constant
values in fields are terminals that appear on the right-hand side of productions Nonterminal
fieldnames have data types associated with them
41 InterLeaved BitMap (ILBM)
The author of the Interleaved Bitmap (ILBM) file format recognized that the structure of the file
format could be specified using pseudo EBNF rules and C data structure declarations [Morrison
1986] In this section the ―grammar of the ILBM file format specification is converted to a
context-free binary grammar with synthesized and inherited attributes
Figure 2 shows a display of an ILBM file hampic Figure 3 shows an attribute grammar for its
chunk-based binary file format The attribute grammar is a mixture of synthesized semantic rules
(attribute values of the left side of a production are calculated from attribute values of symbols
on the right side) and inherited rules (attribute values of symbols on the right side of a
production are calculated from attribute values of the symbol on the left side) The attribute
grammar is also an L-attributed grammar Figure 4 shows a parse tree for the file hampic
7
Figure 2 ILBM File hampic Displayed
ltILBMgt rarr ―FORM ltcksize TYPE=UINT32gt ―ILBM ltBMHDgt ltCMAPgt ltCAMGgt ltBODYgt
ILBMcksize = cksizevalue
ltBMHDgt rarr ―BMHD ltcksize TYPE=uint32gt ltBitmapHeadergt
ltBitMapheadergt rarr ltwidth type=UINT32gt
lt height type=UINT16gt
lt xposition type=INT16gt
ltyposition type=INT16gt
ltnplanes type=BYTEgt
ltmasking type=BYTEgt
ltcompression type=BYTEgt
ltreserved type=BYTEgt
lttransparentcolor type=UINT16gt
ltxaspect type=BYTEgt
ltyaspect type=BYTEgt
ltpagewidth type=INT16gt
ltpageheight type=INT16gt
ltCMAPgt -rarr―CMAP ltcksize type=UINT32gt ltcolorgt[n] n=cksizeval3
ltCAMGgt rarr ―CAMG ltcksizetype=UINT32gt ltviewmode type=INT32gt
ltBODYgt rarr ―BODY ltcksize type=UINT32gt ltdata type=BYTEgt[cksize]
ltcolorgt rarr ltred type=BYTEgt ltgreen type= BYTEgt ltblue type=BYTEgt
Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format
8
Figure 4 A Parse Tree for the ILBM File hampic
9
42 RIFF Waveform PCM Audio File Format (WAVE)
The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp
Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF
rules with the cksize inserted To convert this grammar into a binary file format grammar data
types are added for primitive non-terminals
ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE
ltfmt-ckgt
[ltfact-ckgt]
[ltcue-ckgt]
[ltplaylist-ckgt]
[ltassoc-data-listgt]
ltwave-datagt
ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt
ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt
ltwChannelsWORDgt
ltdwSamplesPerSecDWORDgt
ltdwAvgBytesPerSecDWORDgt
ltwBlockAlignWORDgt
ltwave-datagt -gt ltdata-ckgt | ltwave-listgt
ltdata-ckgt -gt data cksize ltwave-datagt
ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt
ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt
ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt
ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+
ltcue-pointgt -gt ltdwNameDWORDgt
ltdwPositionDWORDgt
ltfccChunkFOURCCgt
ltdwChunkStartDWORDgt
ltdwBlockStartDWORDgt
ltdwSampleOffsetDWORDgt
ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+
ltplay-segmentgt -gt ltdwNameDWORDgt
ltdwLengthDWORDgt
ltdwLoopsDWORD
ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt
ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt
ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt
10
ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt
ltdwSampleLengthDWORDgt
ltdwPurposeDWORDgt
ltwCountryWORDgt
ltwLanguageWORDgt
ltwDialectWORDgt
ltwCodePageWORDgt
ltdataBYTEgt+
ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt
ltfileDataBYTEgt+
Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format
5 Recursive Descent Parsers for Binary File Grammars
A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)
procedures where each procedure implements one of the production rules of the grammar A
predictive parser is a recursive descent parser that does not require backtracking Predictive
parsing is possible only for the class of LL(k) grammars which are the context-free grammars
for which there exists some positive integer k that allows a recursive descent parser to decide
which production to use by examining only the next k tokens of input A recursive descent parser
for the binary file format grammars defined in this paper does not require a separate lexer to
identify the data of a file The data types are predicted by the grammar rules
The semantic rules of an L-attributed grammar can be included in the recursive descent parser to
check for context-sensitive aspects of the grammar such as chunk size or image dimensions and
use these values in parsing the chunk data or images The pointers to file addresses that occur in
the file formats are handled by pushing the current address in the file onto a pushdown stack
When the parser exhausts the procedures implementing production rules corresponding to the
pointed to location the topmost address on the pushdown stack is popped and the procedure
corresponding to the nonterminal at that position is executed
Recursive descent parsers with a pushdown stack can be used with binary file format grammars
to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads
a file and generates an error if the file does not conform to the syntax specified by the grammar
11
6 A Parser Generator for Recognizing Binary File Formats
What we want is a parser generator whose input is a binary file attribute grammar for a particular
binary file format and whose generated output is the Java source code of a recursive descent
parser (with a pushdown stack if needed) for the class of binary file formats specified by the
grammar Such a parser generator does not yet exist but there is a widely used parser generator
for attribute grammars for string-based languages
61 ANTLR
ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing
[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a
language and generates as output source code for a recognizer for that language ANTLR
supports generating code in a number of the programming languages including C Java
JavaScript and Python
ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that
converts a sequence of characters into tokens A lexer is not needed to parse binary file formats
because the data types predicted by the parser are the tokens of the grammar Furthermore the
capability to recognize LL() grammars is not needed because the binary file format grammars
specified so far do not require lookahead of k symbols
62 Using ANTLR to Generate a Parser for a Binary File Format
It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-
based and directory-based binary file formats This is accomplished by writing a lexer that treats
each input byte in a file as a character token Then functions are created for each data type in the
binary file grammar that convert the appropriate number of character tokens to binary data types
eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our
attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a
parser generator for binary file grammars
621 An ANTLR Grammar for an ILBM Chunk-based File Format
Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of
ANTLR is a lexer and parser for ILBM files
12
grammar iblm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in lexerheader is placed at the top of the lexer file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Grammar Rules used to create the Parser
Start symbol ilbm
iblm returns [boolean result]
FORM ckSize ILBM
Print out ILBM chunksize
Systemoutprintln(ID ILBM Size + $ckSizevalue)
Semantic predicate that ensures that at least on BMHD is found
propertyChunks $propertyChunksfoundBMHD == true
dataChunks
body
$result = true
The bmhd cmap and camg property chunks can occur in any order
At least one BMHD data chunk must be present
propertyChunks returns [boolean foundBMHD]
(bmhd $foundBMHD = true
| cmap
| camg)+
bmhd
BMHD ckSize
Systemoutprintln(ID BMHD Size + $ckSizevalue)
13
bitMapHeader
bitMapHeader
width = uWordValue
height = uWordValue
xPosition = wordValue
yPosition = wordValue
nPanes = uByteValue
masking = uByteValue
compression = uByteValue
reserved1 = uByteValue
transparentColor = uWordValue
xAspect = uByteValue
yAspect = uByteValue
pageWidth = wordValue
pageHeight = wordValue
Print out values in the bitmap structure
Systemoutprintln( Width + $widthvalue)
Systemoutprintln( Height + $heightvalue)
Systemoutprintln( X Position + $xPositionvalue)
Systemoutprintln( Y Position + $yPositionvalue)
Systemoutprintln( Number of Panes + $nPanesvalue)
Systemoutprintln( Masking + $maskingvalue)
Systemoutprintln( Compression + $compressionvalue)
Systemoutprintln( Reserved1 + $reserved1value)
Systemoutprintln( TransParent Color +
$transparentColorvalue)
Systemoutprintln( X Aspest + $xAspectvalue)
Systemoutprintln( Y Aspect + $yAspectvalue)
Systemoutprintln( Page Width + $pageWidthvalue)
Systemoutprintln( Page Height + $pageHeightvalue)
cmap returns [long value]
Initialize the index value i in the cmap function
init int i = 0
CMAP ckSize
Print out chunk size
$value = $ckSizevalue
Systemoutprintln(ID CMAP Size + $ckSizevalue)
(colorRegister
Print out color values
Systemoutprintln( Color[ + i++ + ] +
$colorRegisterredValue +
+ $colorRegistergreenValue +
+ $colorRegisterblueValue +
)
)+
colorRegister returns [int redValue int greenValue int blueValue]
14
red = uByteValue green = uByteValue blue = uByteValue
$redValue = $redvalue
$greenValue = $greenvalue
$blueValue = $bluevalue
camg
CAMG ckSize longValue
Systemoutprintln(ID CAMG Size + $ckSizevalue)
Systemoutprintln( View Modes 0x +
LongtoHexString($longValuevalue))
The crng and ccrt data chunks
dataChunks
crng
| ccrt
crng
CRNG ckSize
Systemoutprintln(ID CRNG Size + $ckSizevalue)
cRange
cRange
pad1 = wordValue $pad1value == 0
rate = wordValue
active = wordValue
low = uByteValue
high = uByteValue
ccrt
CCRT ckSize
Systemoutprintln(ID CCRT Size + $ckSizevalue)
cycleInfo
cycleInfo
direction = wordValue
start = uByteValue
end = uByteValue
seconds = longValue
microseconds = longValue
pad = wordValue $padvalue == 0
body
BODY ckSize
Systemoutprintln(ID BODY Size + $ckSizevalue)
$ckSizevalue gt 0 data
Systemoutprintln( Number of bytes + $datavalue)
$datavalue == $ckSizevalue
defines the data chunk of the body chunk
data returns [long value]
(b+=BYTE)+ $bsize() gt 0
$value = $bsize()
15
The cksize is the value of a long integer
ckSize returns [long value]
longValue
$value = $longValuevalue
defines the longValue data typein terms of BYTE terminals
longValue returns [long value] throws UnsupportedEncodingException
h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE
$value = (long)
( getUByteValue($h1) ltlt 24 |
getUByteValue($h2) ltlt 16 |
getUByteValue($l1) ltlt 8 |
getUByteValue($l2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uWordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (short)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uByteValue returns [int value]
BYTE
calls the getUByteValue from a BYTE token
16
$value = (int) getUByteValue($BYTE)
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rules used to create the Lexer
FORM FORM
ILBM ILBM
BMHD BMHD
CMAP CMAP
CAMG CAMG
CRNG CRNG
CCRT CCRT
BODY BODY
BYTE
Figure 6 An ANTLR Grammar for the ILBM File Format
622 The Lexer and Parser
When provided the file described in the previous section ANTLR generates the ilbmlexer and
ilbmparser Java programs The Java application shown in Figure 6 uses these programs to
validate the ILBM file BLKIFF It produces the output shown in Figure 7
package orggtrigrammarschunk
import javaioIOException
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRStringStream
import organtlrruntimeCharStream
import organtlrruntimeCommonTokenStream
import organtlrruntimeRecognitionException
17
import organtlrruntimeTokenStream
import gtriorggrammarschunkiblmLexer
import gtriorggrammarschunkiblmParser
public class Test
static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF
public static void main(String[] args) throws RecognitionException IOException
ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)
iblmLexer lexer = new iblmLexer(stream)
TokenStream tokenStream = new CommonTokenStream(lexer)
iblmParser parser = new iblmParser(tokenStream)
Systemoutprintln(FILE + filePath)
Systemoutprintln(Successfully parsed = + parseriblm())
Figure 7 A Java Application for Parsing an ILBM File
FILE CUserssl170PicturesIFF-ilbmBLKIFF
ID ILBM Size 2556
ID BMHD Size 20
Width 200
Height 144
X Position 0
Y Position 0
Number of Panes 6
Masking0
Compression1
Reserved10
TransParent Color0
X Aspest3
Y Aspect3
Page Width200
Page Height144
ID CMAP Size 48
Color[0] 0 0 0
Color[1] 0 128 128
Color[2] 0 255 255
Color[3] 255 0 255
Color[4] 0 255 0
Color[5] 255 0 0
Color[6] 0 0 255
Color[7] 255 255 0
Color[8] 128 128 128
Color[9] 192 192 192
Color[10] 128 0 0
Color[11] 0 128 0
Color[12] 128 128 0
Color[13] 0 0 128
Color[14] 128 0 128
18
Color[15] 255 255 255
ID CAMG Size 4
View Modes 0x800
ID BODY Size 2448
Number of bytes 2448
Successfully parsed = true
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF
63 An ANTLR Grammar for the RIFF Waveform PCM File Format
Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR
The output of ANTLR is a lexer and parser for WAVEFOR PCM files
grammar waveform_pcm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text contained in lexerheader is placed at the top of the lexer
file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Default ByteOrder to Little Endian
String byteOrder = II
Rules used to create the Parser
start symbol waveform_pcm
waveform_pcm returns [boolean result]
19
ckId = fourCC $ckIdvalueequals(RIFF)
ckSize = dWordValue
formType = fourCC $formTypevalueequals(WAVE)
formatSubChunk
dataSubChunk
$result = true
fourCC returns [String value]
BYTE BYTE BYTE BYTE $value = $text
formatSubChunk
subChunkId = fourCC $subChunkIdvalueequals(fmt )
ckSize = dWordValue
commonFields
pcmSpecificField
commonFields
formatTag = wordValue Format category
channels = wordValue Number of channels
samplesPerSec = dWordValue Sampling rate
avgBytesPerSec = dWordValue For buffer estimation
blockAlign = wordValue Data block size
pcmSpecificField
bitsPerSample = wordValue Sample size
dataSubChunk
subChunkId = fourCC $subChunkIdvalueequals(data)
ckSize = dWordValue
waveData[$ckSizevalue]
waveData [long size]
(size gt= 0=gt BYTE size--)+
defines the long data type in terms of BYTE terminals
dWordValue returns [long value] throws UnsupportedEncodingException
b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE
if (byteOrderequals(II))
$value = (long)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 |
getUByteValue($b3) ltlt 16 |
getUByteValue($b4) ltlt 24 )
else
$value = (long)
( getUByteValue($b1) ltlt 24 |
getUByteValue($b2) ltlt 16 |
20
getUByteValue($b3) ltlt 8 |
getUByteValue($b4) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
if (byteOrderequals(II))
$value = (int)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 )
else
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rule used to create the Lexer
BYTE
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format
7 Related Research
Researchers have developed a number of data description languages to support validation of file
formats EAST (Enhanced ADA Subset) is a data description language developed by the
Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary
Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic
information The EAST description is used to interpret and provide access to information in
binary and text files
DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to
manipulate Java jar files and ELF object files The PADSC data description language can be
used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC
compiler compiles the description into tools that can be used to recognize manipulate and
transform the data into other formats
21
The Data Format Description Language (DFDL) is being developed by a Working Group of the
Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML
Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-
XML physical representations Version 1 of the language specification was published in 2011
and a parser is being implemented
Each of the data description languages described above can be used to define data types and file
structures of binary files However they are based on file structure declarations such as those of
C or Java The binary file format grammar described in this paper most closely resembles DFDL
However the binary file grammar described in this paper is the only data description language
based on formal grammars that is used for creating recognizers for file formats
JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to
provide automated and efficient identification and validation of the formats of digital files
JHOVE is a format-specific digital object validation API written in Java JHOVE supports
validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF
UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California
digital Library 2011]
The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation
of binary file formats As a matter of fact binary file grammars and parsers have been
constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-
based format TIFF However the research reported herein differs from that of the JHOVE
project in that what is sought is a technology for generating validators for binary file formats
from grammars specifying the binary file format
8 Conclusion
The research question addressed by his research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files Amajor family of binary file formats chunk-based was described Then extensions
to the concepts of context-free grammars and attribute grammars were described that enable the
specification of binary file formats Examples of attribute array grammars for two chunk-based
binary file format were then presented Recursive descent parsers for this classes of grammars
was then described Experience in using ANTLR a parser generator for LL() string grammars
in generating parsers for recognizing the formats of binary files was discussed Finally related
research in data description languages for binary file formats was described and the JHOVE
project in creating Java-based tools for validating file formats was described
22
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats Furthermore these grammars can be used with recursive descent
parsers for validating the file formats of chunk-based binary files It remains to be determined
whether these attribute grammars based on binary file (array) grammars are adequate to define
other families of binary file formats
To be a practical technology for generating recognizers (validators) for binary file formats a
parser generator is needed that creates from a binary file grammar a recursive descent parser
(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and
generates a lexer as well as a parser is too heavyweight a parser generator for this task and does
not have the requisite data types built in
A capability to validate binary file formats assumes that the file format of a file is already
known In this research we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009] We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendix a We also have file
signature tests (magic tests) for all these file types
We would like to have a parser generator (also called a compilerndashcompiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents The
closest parser generator that we have found to having these capabilities is ANTLR
(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar However it works with strings and generates a separate scanner
Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based
file formats We have specifications for each of file formats and samples of files with these
formats
These file formats have a similar structure We have developed binary file grammars for two of
these binary file formats We should be able to do so for all of them We used these grammars
with the extensions to ANTLR to generate top-down recursive descent parsers for the file
formats defined with these grammars
The significance of success in this research task is that if binary file formats can be specified
with binary file grammars then only one parsing generator is needed to generate the parsers for
verifying that a file conforms to a particular format Similarly the same parsing generator can be
used for conversion of legacy file formats to current or standard formats Finally the same parser
generator can be used for generating viewerplayers for most file formats This would increase
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
ii
ABSTRACT
The capability to validate and view or play binary file formats as well as to convert binary file
formats to standard or current file formats is critically important to the preservation of digital
data and records
Context-free grammars are used for defining programming languages There are established
methods for using these grammars to check the syntax of programs and to compile programs into
assembly or machine language programs Grammars have also been used to define page
description languages and vector graphics languages However current technologies for context-
free grammars syntax checkers and compilers are limited to string languages or string data
types
This report describes the extension of context-free grammars from strings to binary files Binary
files are arrays of data types such as long and short integers floating-point numbers and pointers
as well as characters The concept of an attribute grammar is extended to these context-free array
grammars Attribute grammars have been used to define a number of chunk-based file formats
A parser generator has been used with some of these grammars to generate syntax checkers
(recognizers) for validating binary file formats The significance of these results is that with
these extensions to core computer science concepts traditional parsercompiler technologies can
be used as a part of a cost effective preservation strategy for binary file formats
iii
TABLE OF CONTENTS
1 INTRODUCTION 1
11 BACKGROUND 1
12 PURPOSE 1
13 SCOPE 1
2 BINARY FILE FORMATS 1
3 CONTEXT-FREE GRAMMARS AND ATTRIBUTE GRAMMARS 3
31 CONTEXT-FREE GRAMMARS FOR ARRAYS OF DATA TYPES 3
32 ATTRIBUTE GRAMMARS 4
33 SEMANTIC GRAMMAR 5
4 GRAMMARS FOR CHUNK-BASED BINARY FILE FORMATS 6
41 INTERLEAVED BITMAP (ILBM) 6
42 RIFF WAVEFORM PCM AUDIO FILE FORMAT (WAVE) 9
5 RECURSIVE DESCENT PARSERS FOR BINARY FILE GRAMMARS 10
6 A PARSER GENERATOR FOR RECOGNIZING BINARY FILE FORMATS 11
61 ANTLR 11
62 USING ANTLR TO GENERATE A PARSER FOR A BINARY FILE FORMAT 11
621 An ANTLR Grammar for an ILBM Chunk-based File Format 11
622 The Lexer and Parser 16
63 AN ANTLR GRAMMAR FOR THE RIFF WAVEFORM PCM FILE FORMAT 18
7 RELATED RESEARCH 20
8 CONCLUSION 21
REFERENCES 24
APPENDIX A CHUNK-BASED BINARY FILE FORMATS 33
A1 ELECTRONIC ARTS INTERCHANGE FILE FORMAT (IFF) 33
A11 Interleaved Bitmap (ILBM) 33
A12 8-bit Sampled Voice (8SVX) 33
A13 Cel Animation (ANIM) 34
A14 IFF Formatted Text (FTXT) 35
A15 IFF Simple Musical Score (SMUS) 35
A16 Planar Bitmap (PBM) 35
A2 STANDARD MIDI FILE FORMAT 35
A3 AUDIO INTERCHANGE FILE FORMAT 35
A4 RESOURCE INTERCHANGE FILE FORMATS 37
A41 RIFF-Waveform (WAVE) 37
A42 WAVEFORMATEX 37
A43 WAVEFORMATEXTENSIBLE 37
A44 Broadcast Wave Format 37
iv
A45 Windows Audio-Video Interleave (AVI) 37
A46 Animated Windows Cursor (ANI) 37
A47 RIFF MIDIfile (RMID) 37
A48 Device-Independent Bitmap (RIFF RDIB) 37
A49 WebP 37
A5 JPEG 38
A51 JPEGJFIF 38
A52 JPEGExif 38
A53 JPEG 2000 39
A54 JPEG 2000 Codestream 39
A6 ADVANCED SYSTEMS FORMAT 39
A61 Windows Media Audio 7-9 39
A62 Windows Media Video 7-91 39
A7 PORTABLE NETWORK GRAPHICS 39
A71 Multiple-image Network Graphics 40
A72 JPEG Network Graphics 40
A8 BINARY INTERCHANGE FILE FORMAT 40
A9 3D STUDIO 40
A91 3D Studio File Format ndash 3ds 40
A92 3D Studio Material Library Files 40
A10 ANIM8OR 40
A11 ANIMATOR 40
A111 Animator Cel 40
A112 Animator Pro Fli 40
A113 Animator Pro fLC 40
A111 Animator Cel 40
A114 Animator Pro PIC 40
A12 AUDIO MANAGER MODULE 41
A13 AUTODESK DRAWING INTERCHANGE FILES (DXF) (BINARY) 41
A14 CALIGARI TRUESPACE SCENEOBJECT (BINARY) 41
A15 CORELDRAW VECTOR GRAPHICS FILE 41
A16 COREL METAFILE EXCHANGE IMAGE 41
A17 MUSIC MODULES 41
A171 DigiTrekker Module Format 41
A172 Graoumf Tracker modules 41
A173 Oktalyzer 41
A18 EA MULTIMEDIA GAME FILE FORMATS 41
A181 Electronic Arts Music (asf) 41
A182 Electronic Arts CMV 41
A183 Electronic Arts DCT 41
A184 Electronic Arts MAD 42
A185 Electronic Arts MPC 42
A186 Electronic Arts VP7 42
A187 Electronic Arts TGV 42
A188 Electronic Arts TGQ 42
v
A189 Electronic Arts TQI 42
A19 IMAGINE 3D DATA DESCRIPTION (TDDD) 42
A20 LIGHTWAVE 3D OBJECTS 42
A201 LWOB 42
A202 LWO2 42
A203 LWLO 42
A21 LUXOLOGY MODO 3D OBJECT (LXOB) 42
A22 PAINT SHOP PRO (PSP) 42
A23 PROVECTOR 2D DRAWING 43
A24 PROWRITE DOCUMENT 43
A25 INTERACTIVEFICTION 43
A251 Blorb Z-machine Resource File 43
A252 Quetzal 43
A26 APPLE QUICKTIME 43
A27 MAYA IFF BITMAP 43
A28 APPLE CORE AUDIO FORMAT 43
A29 AMERICAN LASER GAMES (MM) 43
A30 SMUSH ANIMATION FORMAT 43
A301 SAN 43
A302 SNM 43
A31 STRUCTURED DATA EXCHANGE FORMAT (SDXF) 43
A32 OGG VORBIS 44
A33 SIMPLE VECTOR ARRAY FORMAT 44
A34 NESTED LIST FILE 44
A35 CINEMA 4D 44
A36 THE SIMS INTERCHANGE FILE FORMAT 44
vi
List of Figures
Figure 1 File format (Layout) of a RIFF container for a Webp Image 2
Figure 2 ILBM File hampic Displayed 7
Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format 7
Figure 4 A Parse Tree for the ILBM File hampic 8
Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format 10
Figure 6 An ANTLR Grammar for the ILBM File Format 16
Figure 7 A Java Application for Parsing an ILBM File 17
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF 18
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format 20
1
1 Introduction
11 Background
Automated tools are required for identifying and validating the formats of the huge number of
files ingested into digital data and record archives Automated tools are also needed for
viewingplaying text and binary file formats and for converting legacy and obsolete file formats
to standard or current formats Technologies such as data description languages have emerged to
address these preservation challenges [Dunckley et al 2007]
Context-free grammars have been used to specify the syntax of programming languages
Attribute grammars provide a framework for formally specifying the semantics of a language
based on its context-free grammar and for addressing the mildly context-sensitive features of
programming languages such as agreement of the data types of variables in expressions with data
type declarations Such grammars have parsingtranslation algorithms that can be used for
syntax-checking and interpretation or translation of the languages the grammars define
12 Purpose
The research question addressed by this research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files
13 Scope
The next section discusses the traditional approach to specifying file formats and two of the
major families of binary file formats In section 3 extensions to the concepts of context-free
grammars and attribute grammars are described that enable the specification of binary file
formats In section 4 examples of attribute array grammars for a chunk-based binary file formats
are presented In section 5 recursive descent parsers for these classes of grammars are described
In section 6 experience in using ANTLR a parser generator for LL() string grammars in
generating parsers for recognizing the formats of binary files is discussed In section 7 related
research is described Section 8 summarizes the results
2 Binary File Formats
Traditionally a file (or record) layout was a form showing how fields were positioned in a file
(or record) The fields are named and have as attributes a data type length and sometimes a
constant value Fields are offset at addresses relative to the beginning of a file File formats are
still specified in this manner For instance Figure 1 shows the format layout of a RIFF container
for the VP8 codec of a Webp image [Google 2010]
2
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk starting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
Figure 1 File format (Layout) of a RIFF container for a Webp Image
Many binary file formats are specified using pseudo-regular expressions or pseudo-EBNF
notation EBNF (Extended Backus-Naur Form) is a notation for expressing context-free
grammars in a compact human readable way Pseudo-EBNF is similar in concept to pseudocode
which is a high-level description of a computer algorithm that is intended for human
understanding but that omits details that would be necessary for computer execution The
pseudo-EBNF (or regular expression) specification of a file format is augmented with a natural
language description of the details that are not actually expressible in the EBNF (or regular
expression) notation These details are often the context-sensitive relationships of the size of an
array to its actual length or the relationship of an address pointer to the actual location of the
data pointed to in the file These context-sensitive relationships are not expressible in a context-
free grammar (or EBNF notation) It is a goal of this research to extend the EBNF notation and
concept of a context-free grammar to include the details that are necessary to precisely specify
the file formats of binary file formats that include these context-sensitive features
Binary file formats with a similar file structure are referred to as a family of file formats There
are two families of binary file formats that are readily distinguished the chunk-based and the
directory-based file formats
Chunk-based file formats were created by Electronic Arts and Commodore-Amiga as the
Interchange File Format (IFF) [Morrison 1985] An IFF file itself is one entire IFF chunk A
chunk consists of an ID tag a size and size bytes of data The data may include chunks called
subchunks Subchunks have the same structure as chunks Subchunks can have subchunks
Chunk-based file formats are primarily used as containers for multimedia but have also been
used for word processing files including pictures and other figures File format specifications for
chunk-based formats may use other terms than chunks in describing the format for instance
atoms in QuickTimeMP4 segments in JPEG and ―tagged data representations as in
AutoCAD DXF
3
The Audio Interchange File Format (AIFF) developed by Apple computer in 1988 was based on
Electronic Arts Interchange File Format Apple Core Audio Format (CAF) Applelsquos replacement
format for AIFF remains a chunk-based format [Apple Computer 2005] The Resource
Interchange File Format (RIFF) introduced by IBM and Microsoft [1991] is also based on
Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-Video
Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis WebP a picture format
recently introduced by Google[2010] also uses RIFF as a container
Microsoftlsquos Advanced Systems Format (ASF) [2004] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV) Microsoftlsquos Binary Interchange File Format [Rentz 2008]
(Microsoft Excellsquos File Format) is chunk-based
Another family of binary file formats is directory-based A directory-based file format consists of
one or more directory tables which contain one or more directory entries A directory entry
specifies where the actual data for a type of information is located This is the scheme used in
TIFF files OLE (Microsoft Object Linking and Embedding) files OASIS OpenDocument and
Microsoft Open Office files
3 Context-Free Grammars and Attribute Grammars
Context-free grammars have been widely used to define programming languages A context-free
grammar G is a quadruple ltN Σ S Pgt where
N is a finite set of non-terminal symbols
Σ is a finite set of terminal symbols
S ε N is the start symbol
P is a set of production rules of the form N rarr N Σ
The set of all strings over a vocabulary Σ is denoted by Σ The language generated by a context-
free grammar G is the set L(G) = w w Σ and S w
31 Context-Free Grammars for Arrays of Data Types
An array is a data structure consisting of a collection of elements (values or variables) each
identified by an index An array is stored so that the position of each element can be computed
from its index For example an array of 10 integer variables with indices 0 through 9 may be
4
stored as 10 (4 byte) words at file addresses 0 4 8 hellip36 so that the element with index i has the
address 4 times i
Arrays are used to implement many other data structures such as lists and strings In most
modern computers the internal memory and external memory of storage devices (and files on
those devices) is a one-dimensional array of data types whose indices are their addresses
A data type is a classification of one of various types of data such as floating-point integer
character pointer or Boolean that determines the possible values for that type the operations
that can be performed on values of that type and the way values of that type can be stored Data
types have names such as int16 int32 float char bool ptr We define a binary file to be an array
of values of data of various types
We define a context-free binary file grammar BG as a quintuple ltN D Σ S Pgt where
N is a finite set of non-terminal symbols
D is a set of data types
Σ is a finite set of binary values of data types D called terminals
S ε N is the start symbol
P is a set of production rules of the form N rarr N Σ
Let DataTypes indicate the union of all the values of all datatypes D The set of all binary files is
BinaryFiles = [IndicesrarrDataTypes] where Indices = 1 The language generated by a
context-free array (binary file) grammar AG is the set L(AG) = w w BinaryFiles and S w
By primitive nonterminal is meant a symbol which would be replaced by one or more terminal
symbols were an additional production rule applied In other words a primitive nonterminal is a
nonterminal that appears on the left-hand side of productions which only have one or more
terminal symbols on the right We adopt the notation shown below for primitive nonterminals
ltprimitive_nonterminal_symbol type=datatype_name constant=valuegt
32 Attribute Grammars
Context-free grammars cannot represent context-sensitive aspects of programming languages
such as (1) enforcing the constraint that all variables are declared before they are used or (2)
checking the number of parameters in a function call against the number in the functionlsquos
declaration Context-free grammars have also been used to define the syntax of English and other
natural languages but they cannot define such context-sensitivity as subject verb agreement in
English sentences A number of extensions of context-free grammars have been proposed to
5
address the context-sensitivity of programming languages and natural languages These include
indexed [Aho 1968] recording [Barth 1979] affix [Koster 1971] VW [van Wijngaarden 1974]
and attribute grammars [Knuth 1969 1971]
Context-free grammars also cannot represent the semantics of programming languages Knuth
[1968 1971] proposed an extension of context-free grammars termed attribute grammars that
addresses the semantics as well as the context-sensitivity of programming languages
An attribute grammar GA is a triple ltG A ARgt where G is a context-free grammar for the
language A associates each grammar symbol X isin (N Σ) with a set of attributes and AR
associates each production R isin P with a set of attribute computation rules A(X) where X isin (N
Σ) can be further partitioned into two sets synthesized attributes S(X) and inherited attributes
I(X) AR(R) where R isin P contains rules for computing inherited and synthesized attributes
associated with the symbols in the production R Synthesized attributes are evaluated and passed
up from deeper terminals and non-terminals while inherited attributes are passed down from
upper non-terminals
An L-attributed grammar is an attribute grammar that uses
i) synthesized attributes and
ii) inherited attributes where the inherited attribute depends only on
a) attributes of the other child nodes to its left
b) inherited attributes of P itself
L-attributed grammars allow the attributes to be evaluated in one left-to-right traversal of the
abstract syntax tree As a result attribute evaluation in L-attributed grammars can be
incorporated conveniently in top-down parsing
33 Semantic Grammar
An interpreter for a binary file format that is using a grammar to display or play the contents of a
file or to convert it to another file format needs to produce a semantic representation that is
appropriate to the task Natural language understanding components of dialog systems face a
similar need to produce a semantic representation that is appropriate to the dialogue task To
address this need some developers of dialogue systems make use of an approach called semantic
grammar [Issar and Ward 1993 Ward and Issar 1994] A semantic grammar is a CFG in which
the non-terminals on the left-hand side of rules correspond to semantic entities A parser for a
semantic grammar produces a labeling of the input string with semantic node labels for example
SHOW FLIGHTS ORIGIN DESTINATION DEPART_DATE DEPART_TIME Show me flights from Boston to San Francisco on Tuesday morning
6
A semantic grammar is just what we need to file in the values of a programming language data
structure Semantic grammars are not applicable to generic systems but are limited to specific
domains This is not a problem in specifying or parsing a binary file format since a grammar for
a binary file format is not meant to apply to all binary file formats but just to a single format or
to a family of file formats
4 Grammars for Chunk-based Binary File Formats
The attribute grammar that we present will be a mixture of synthesized semantic rules (attribute
values of the left side of a production are calculated from attribute values of symbols on the right
side) and inherited rules (attribute values of symbols on the right side of a production are
calculated from attribute values of the symbol on the left side)
The field names of the file format spec are represented as nonterminals in the grammar Constant
values in fields are terminals that appear on the right-hand side of productions Nonterminal
fieldnames have data types associated with them
41 InterLeaved BitMap (ILBM)
The author of the Interleaved Bitmap (ILBM) file format recognized that the structure of the file
format could be specified using pseudo EBNF rules and C data structure declarations [Morrison
1986] In this section the ―grammar of the ILBM file format specification is converted to a
context-free binary grammar with synthesized and inherited attributes
Figure 2 shows a display of an ILBM file hampic Figure 3 shows an attribute grammar for its
chunk-based binary file format The attribute grammar is a mixture of synthesized semantic rules
(attribute values of the left side of a production are calculated from attribute values of symbols
on the right side) and inherited rules (attribute values of symbols on the right side of a
production are calculated from attribute values of the symbol on the left side) The attribute
grammar is also an L-attributed grammar Figure 4 shows a parse tree for the file hampic
7
Figure 2 ILBM File hampic Displayed
ltILBMgt rarr ―FORM ltcksize TYPE=UINT32gt ―ILBM ltBMHDgt ltCMAPgt ltCAMGgt ltBODYgt
ILBMcksize = cksizevalue
ltBMHDgt rarr ―BMHD ltcksize TYPE=uint32gt ltBitmapHeadergt
ltBitMapheadergt rarr ltwidth type=UINT32gt
lt height type=UINT16gt
lt xposition type=INT16gt
ltyposition type=INT16gt
ltnplanes type=BYTEgt
ltmasking type=BYTEgt
ltcompression type=BYTEgt
ltreserved type=BYTEgt
lttransparentcolor type=UINT16gt
ltxaspect type=BYTEgt
ltyaspect type=BYTEgt
ltpagewidth type=INT16gt
ltpageheight type=INT16gt
ltCMAPgt -rarr―CMAP ltcksize type=UINT32gt ltcolorgt[n] n=cksizeval3
ltCAMGgt rarr ―CAMG ltcksizetype=UINT32gt ltviewmode type=INT32gt
ltBODYgt rarr ―BODY ltcksize type=UINT32gt ltdata type=BYTEgt[cksize]
ltcolorgt rarr ltred type=BYTEgt ltgreen type= BYTEgt ltblue type=BYTEgt
Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format
8
Figure 4 A Parse Tree for the ILBM File hampic
9
42 RIFF Waveform PCM Audio File Format (WAVE)
The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp
Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF
rules with the cksize inserted To convert this grammar into a binary file format grammar data
types are added for primitive non-terminals
ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE
ltfmt-ckgt
[ltfact-ckgt]
[ltcue-ckgt]
[ltplaylist-ckgt]
[ltassoc-data-listgt]
ltwave-datagt
ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt
ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt
ltwChannelsWORDgt
ltdwSamplesPerSecDWORDgt
ltdwAvgBytesPerSecDWORDgt
ltwBlockAlignWORDgt
ltwave-datagt -gt ltdata-ckgt | ltwave-listgt
ltdata-ckgt -gt data cksize ltwave-datagt
ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt
ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt
ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt
ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+
ltcue-pointgt -gt ltdwNameDWORDgt
ltdwPositionDWORDgt
ltfccChunkFOURCCgt
ltdwChunkStartDWORDgt
ltdwBlockStartDWORDgt
ltdwSampleOffsetDWORDgt
ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+
ltplay-segmentgt -gt ltdwNameDWORDgt
ltdwLengthDWORDgt
ltdwLoopsDWORD
ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt
ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt
ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt
10
ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt
ltdwSampleLengthDWORDgt
ltdwPurposeDWORDgt
ltwCountryWORDgt
ltwLanguageWORDgt
ltwDialectWORDgt
ltwCodePageWORDgt
ltdataBYTEgt+
ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt
ltfileDataBYTEgt+
Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format
5 Recursive Descent Parsers for Binary File Grammars
A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)
procedures where each procedure implements one of the production rules of the grammar A
predictive parser is a recursive descent parser that does not require backtracking Predictive
parsing is possible only for the class of LL(k) grammars which are the context-free grammars
for which there exists some positive integer k that allows a recursive descent parser to decide
which production to use by examining only the next k tokens of input A recursive descent parser
for the binary file format grammars defined in this paper does not require a separate lexer to
identify the data of a file The data types are predicted by the grammar rules
The semantic rules of an L-attributed grammar can be included in the recursive descent parser to
check for context-sensitive aspects of the grammar such as chunk size or image dimensions and
use these values in parsing the chunk data or images The pointers to file addresses that occur in
the file formats are handled by pushing the current address in the file onto a pushdown stack
When the parser exhausts the procedures implementing production rules corresponding to the
pointed to location the topmost address on the pushdown stack is popped and the procedure
corresponding to the nonterminal at that position is executed
Recursive descent parsers with a pushdown stack can be used with binary file format grammars
to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads
a file and generates an error if the file does not conform to the syntax specified by the grammar
11
6 A Parser Generator for Recognizing Binary File Formats
What we want is a parser generator whose input is a binary file attribute grammar for a particular
binary file format and whose generated output is the Java source code of a recursive descent
parser (with a pushdown stack if needed) for the class of binary file formats specified by the
grammar Such a parser generator does not yet exist but there is a widely used parser generator
for attribute grammars for string-based languages
61 ANTLR
ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing
[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a
language and generates as output source code for a recognizer for that language ANTLR
supports generating code in a number of the programming languages including C Java
JavaScript and Python
ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that
converts a sequence of characters into tokens A lexer is not needed to parse binary file formats
because the data types predicted by the parser are the tokens of the grammar Furthermore the
capability to recognize LL() grammars is not needed because the binary file format grammars
specified so far do not require lookahead of k symbols
62 Using ANTLR to Generate a Parser for a Binary File Format
It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-
based and directory-based binary file formats This is accomplished by writing a lexer that treats
each input byte in a file as a character token Then functions are created for each data type in the
binary file grammar that convert the appropriate number of character tokens to binary data types
eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our
attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a
parser generator for binary file grammars
621 An ANTLR Grammar for an ILBM Chunk-based File Format
Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of
ANTLR is a lexer and parser for ILBM files
12
grammar iblm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in lexerheader is placed at the top of the lexer file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Grammar Rules used to create the Parser
Start symbol ilbm
iblm returns [boolean result]
FORM ckSize ILBM
Print out ILBM chunksize
Systemoutprintln(ID ILBM Size + $ckSizevalue)
Semantic predicate that ensures that at least on BMHD is found
propertyChunks $propertyChunksfoundBMHD == true
dataChunks
body
$result = true
The bmhd cmap and camg property chunks can occur in any order
At least one BMHD data chunk must be present
propertyChunks returns [boolean foundBMHD]
(bmhd $foundBMHD = true
| cmap
| camg)+
bmhd
BMHD ckSize
Systemoutprintln(ID BMHD Size + $ckSizevalue)
13
bitMapHeader
bitMapHeader
width = uWordValue
height = uWordValue
xPosition = wordValue
yPosition = wordValue
nPanes = uByteValue
masking = uByteValue
compression = uByteValue
reserved1 = uByteValue
transparentColor = uWordValue
xAspect = uByteValue
yAspect = uByteValue
pageWidth = wordValue
pageHeight = wordValue
Print out values in the bitmap structure
Systemoutprintln( Width + $widthvalue)
Systemoutprintln( Height + $heightvalue)
Systemoutprintln( X Position + $xPositionvalue)
Systemoutprintln( Y Position + $yPositionvalue)
Systemoutprintln( Number of Panes + $nPanesvalue)
Systemoutprintln( Masking + $maskingvalue)
Systemoutprintln( Compression + $compressionvalue)
Systemoutprintln( Reserved1 + $reserved1value)
Systemoutprintln( TransParent Color +
$transparentColorvalue)
Systemoutprintln( X Aspest + $xAspectvalue)
Systemoutprintln( Y Aspect + $yAspectvalue)
Systemoutprintln( Page Width + $pageWidthvalue)
Systemoutprintln( Page Height + $pageHeightvalue)
cmap returns [long value]
Initialize the index value i in the cmap function
init int i = 0
CMAP ckSize
Print out chunk size
$value = $ckSizevalue
Systemoutprintln(ID CMAP Size + $ckSizevalue)
(colorRegister
Print out color values
Systemoutprintln( Color[ + i++ + ] +
$colorRegisterredValue +
+ $colorRegistergreenValue +
+ $colorRegisterblueValue +
)
)+
colorRegister returns [int redValue int greenValue int blueValue]
14
red = uByteValue green = uByteValue blue = uByteValue
$redValue = $redvalue
$greenValue = $greenvalue
$blueValue = $bluevalue
camg
CAMG ckSize longValue
Systemoutprintln(ID CAMG Size + $ckSizevalue)
Systemoutprintln( View Modes 0x +
LongtoHexString($longValuevalue))
The crng and ccrt data chunks
dataChunks
crng
| ccrt
crng
CRNG ckSize
Systemoutprintln(ID CRNG Size + $ckSizevalue)
cRange
cRange
pad1 = wordValue $pad1value == 0
rate = wordValue
active = wordValue
low = uByteValue
high = uByteValue
ccrt
CCRT ckSize
Systemoutprintln(ID CCRT Size + $ckSizevalue)
cycleInfo
cycleInfo
direction = wordValue
start = uByteValue
end = uByteValue
seconds = longValue
microseconds = longValue
pad = wordValue $padvalue == 0
body
BODY ckSize
Systemoutprintln(ID BODY Size + $ckSizevalue)
$ckSizevalue gt 0 data
Systemoutprintln( Number of bytes + $datavalue)
$datavalue == $ckSizevalue
defines the data chunk of the body chunk
data returns [long value]
(b+=BYTE)+ $bsize() gt 0
$value = $bsize()
15
The cksize is the value of a long integer
ckSize returns [long value]
longValue
$value = $longValuevalue
defines the longValue data typein terms of BYTE terminals
longValue returns [long value] throws UnsupportedEncodingException
h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE
$value = (long)
( getUByteValue($h1) ltlt 24 |
getUByteValue($h2) ltlt 16 |
getUByteValue($l1) ltlt 8 |
getUByteValue($l2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uWordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (short)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uByteValue returns [int value]
BYTE
calls the getUByteValue from a BYTE token
16
$value = (int) getUByteValue($BYTE)
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rules used to create the Lexer
FORM FORM
ILBM ILBM
BMHD BMHD
CMAP CMAP
CAMG CAMG
CRNG CRNG
CCRT CCRT
BODY BODY
BYTE
Figure 6 An ANTLR Grammar for the ILBM File Format
622 The Lexer and Parser
When provided the file described in the previous section ANTLR generates the ilbmlexer and
ilbmparser Java programs The Java application shown in Figure 6 uses these programs to
validate the ILBM file BLKIFF It produces the output shown in Figure 7
package orggtrigrammarschunk
import javaioIOException
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRStringStream
import organtlrruntimeCharStream
import organtlrruntimeCommonTokenStream
import organtlrruntimeRecognitionException
17
import organtlrruntimeTokenStream
import gtriorggrammarschunkiblmLexer
import gtriorggrammarschunkiblmParser
public class Test
static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF
public static void main(String[] args) throws RecognitionException IOException
ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)
iblmLexer lexer = new iblmLexer(stream)
TokenStream tokenStream = new CommonTokenStream(lexer)
iblmParser parser = new iblmParser(tokenStream)
Systemoutprintln(FILE + filePath)
Systemoutprintln(Successfully parsed = + parseriblm())
Figure 7 A Java Application for Parsing an ILBM File
FILE CUserssl170PicturesIFF-ilbmBLKIFF
ID ILBM Size 2556
ID BMHD Size 20
Width 200
Height 144
X Position 0
Y Position 0
Number of Panes 6
Masking0
Compression1
Reserved10
TransParent Color0
X Aspest3
Y Aspect3
Page Width200
Page Height144
ID CMAP Size 48
Color[0] 0 0 0
Color[1] 0 128 128
Color[2] 0 255 255
Color[3] 255 0 255
Color[4] 0 255 0
Color[5] 255 0 0
Color[6] 0 0 255
Color[7] 255 255 0
Color[8] 128 128 128
Color[9] 192 192 192
Color[10] 128 0 0
Color[11] 0 128 0
Color[12] 128 128 0
Color[13] 0 0 128
Color[14] 128 0 128
18
Color[15] 255 255 255
ID CAMG Size 4
View Modes 0x800
ID BODY Size 2448
Number of bytes 2448
Successfully parsed = true
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF
63 An ANTLR Grammar for the RIFF Waveform PCM File Format
Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR
The output of ANTLR is a lexer and parser for WAVEFOR PCM files
grammar waveform_pcm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text contained in lexerheader is placed at the top of the lexer
file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Default ByteOrder to Little Endian
String byteOrder = II
Rules used to create the Parser
start symbol waveform_pcm
waveform_pcm returns [boolean result]
19
ckId = fourCC $ckIdvalueequals(RIFF)
ckSize = dWordValue
formType = fourCC $formTypevalueequals(WAVE)
formatSubChunk
dataSubChunk
$result = true
fourCC returns [String value]
BYTE BYTE BYTE BYTE $value = $text
formatSubChunk
subChunkId = fourCC $subChunkIdvalueequals(fmt )
ckSize = dWordValue
commonFields
pcmSpecificField
commonFields
formatTag = wordValue Format category
channels = wordValue Number of channels
samplesPerSec = dWordValue Sampling rate
avgBytesPerSec = dWordValue For buffer estimation
blockAlign = wordValue Data block size
pcmSpecificField
bitsPerSample = wordValue Sample size
dataSubChunk
subChunkId = fourCC $subChunkIdvalueequals(data)
ckSize = dWordValue
waveData[$ckSizevalue]
waveData [long size]
(size gt= 0=gt BYTE size--)+
defines the long data type in terms of BYTE terminals
dWordValue returns [long value] throws UnsupportedEncodingException
b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE
if (byteOrderequals(II))
$value = (long)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 |
getUByteValue($b3) ltlt 16 |
getUByteValue($b4) ltlt 24 )
else
$value = (long)
( getUByteValue($b1) ltlt 24 |
getUByteValue($b2) ltlt 16 |
20
getUByteValue($b3) ltlt 8 |
getUByteValue($b4) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
if (byteOrderequals(II))
$value = (int)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 )
else
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rule used to create the Lexer
BYTE
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format
7 Related Research
Researchers have developed a number of data description languages to support validation of file
formats EAST (Enhanced ADA Subset) is a data description language developed by the
Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary
Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic
information The EAST description is used to interpret and provide access to information in
binary and text files
DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to
manipulate Java jar files and ELF object files The PADSC data description language can be
used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC
compiler compiles the description into tools that can be used to recognize manipulate and
transform the data into other formats
21
The Data Format Description Language (DFDL) is being developed by a Working Group of the
Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML
Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-
XML physical representations Version 1 of the language specification was published in 2011
and a parser is being implemented
Each of the data description languages described above can be used to define data types and file
structures of binary files However they are based on file structure declarations such as those of
C or Java The binary file format grammar described in this paper most closely resembles DFDL
However the binary file grammar described in this paper is the only data description language
based on formal grammars that is used for creating recognizers for file formats
JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to
provide automated and efficient identification and validation of the formats of digital files
JHOVE is a format-specific digital object validation API written in Java JHOVE supports
validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF
UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California
digital Library 2011]
The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation
of binary file formats As a matter of fact binary file grammars and parsers have been
constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-
based format TIFF However the research reported herein differs from that of the JHOVE
project in that what is sought is a technology for generating validators for binary file formats
from grammars specifying the binary file format
8 Conclusion
The research question addressed by his research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files Amajor family of binary file formats chunk-based was described Then extensions
to the concepts of context-free grammars and attribute grammars were described that enable the
specification of binary file formats Examples of attribute array grammars for two chunk-based
binary file format were then presented Recursive descent parsers for this classes of grammars
was then described Experience in using ANTLR a parser generator for LL() string grammars
in generating parsers for recognizing the formats of binary files was discussed Finally related
research in data description languages for binary file formats was described and the JHOVE
project in creating Java-based tools for validating file formats was described
22
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats Furthermore these grammars can be used with recursive descent
parsers for validating the file formats of chunk-based binary files It remains to be determined
whether these attribute grammars based on binary file (array) grammars are adequate to define
other families of binary file formats
To be a practical technology for generating recognizers (validators) for binary file formats a
parser generator is needed that creates from a binary file grammar a recursive descent parser
(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and
generates a lexer as well as a parser is too heavyweight a parser generator for this task and does
not have the requisite data types built in
A capability to validate binary file formats assumes that the file format of a file is already
known In this research we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009] We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendix a We also have file
signature tests (magic tests) for all these file types
We would like to have a parser generator (also called a compilerndashcompiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents The
closest parser generator that we have found to having these capabilities is ANTLR
(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar However it works with strings and generates a separate scanner
Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based
file formats We have specifications for each of file formats and samples of files with these
formats
These file formats have a similar structure We have developed binary file grammars for two of
these binary file formats We should be able to do so for all of them We used these grammars
with the extensions to ANTLR to generate top-down recursive descent parsers for the file
formats defined with these grammars
The significance of success in this research task is that if binary file formats can be specified
with binary file grammars then only one parsing generator is needed to generate the parsers for
verifying that a file conforms to a particular format Similarly the same parsing generator can be
used for conversion of legacy file formats to current or standard formats Finally the same parser
generator can be used for generating viewerplayers for most file formats This would increase
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
iii
TABLE OF CONTENTS
1 INTRODUCTION 1
11 BACKGROUND 1
12 PURPOSE 1
13 SCOPE 1
2 BINARY FILE FORMATS 1
3 CONTEXT-FREE GRAMMARS AND ATTRIBUTE GRAMMARS 3
31 CONTEXT-FREE GRAMMARS FOR ARRAYS OF DATA TYPES 3
32 ATTRIBUTE GRAMMARS 4
33 SEMANTIC GRAMMAR 5
4 GRAMMARS FOR CHUNK-BASED BINARY FILE FORMATS 6
41 INTERLEAVED BITMAP (ILBM) 6
42 RIFF WAVEFORM PCM AUDIO FILE FORMAT (WAVE) 9
5 RECURSIVE DESCENT PARSERS FOR BINARY FILE GRAMMARS 10
6 A PARSER GENERATOR FOR RECOGNIZING BINARY FILE FORMATS 11
61 ANTLR 11
62 USING ANTLR TO GENERATE A PARSER FOR A BINARY FILE FORMAT 11
621 An ANTLR Grammar for an ILBM Chunk-based File Format 11
622 The Lexer and Parser 16
63 AN ANTLR GRAMMAR FOR THE RIFF WAVEFORM PCM FILE FORMAT 18
7 RELATED RESEARCH 20
8 CONCLUSION 21
REFERENCES 24
APPENDIX A CHUNK-BASED BINARY FILE FORMATS 33
A1 ELECTRONIC ARTS INTERCHANGE FILE FORMAT (IFF) 33
A11 Interleaved Bitmap (ILBM) 33
A12 8-bit Sampled Voice (8SVX) 33
A13 Cel Animation (ANIM) 34
A14 IFF Formatted Text (FTXT) 35
A15 IFF Simple Musical Score (SMUS) 35
A16 Planar Bitmap (PBM) 35
A2 STANDARD MIDI FILE FORMAT 35
A3 AUDIO INTERCHANGE FILE FORMAT 35
A4 RESOURCE INTERCHANGE FILE FORMATS 37
A41 RIFF-Waveform (WAVE) 37
A42 WAVEFORMATEX 37
A43 WAVEFORMATEXTENSIBLE 37
A44 Broadcast Wave Format 37
iv
A45 Windows Audio-Video Interleave (AVI) 37
A46 Animated Windows Cursor (ANI) 37
A47 RIFF MIDIfile (RMID) 37
A48 Device-Independent Bitmap (RIFF RDIB) 37
A49 WebP 37
A5 JPEG 38
A51 JPEGJFIF 38
A52 JPEGExif 38
A53 JPEG 2000 39
A54 JPEG 2000 Codestream 39
A6 ADVANCED SYSTEMS FORMAT 39
A61 Windows Media Audio 7-9 39
A62 Windows Media Video 7-91 39
A7 PORTABLE NETWORK GRAPHICS 39
A71 Multiple-image Network Graphics 40
A72 JPEG Network Graphics 40
A8 BINARY INTERCHANGE FILE FORMAT 40
A9 3D STUDIO 40
A91 3D Studio File Format ndash 3ds 40
A92 3D Studio Material Library Files 40
A10 ANIM8OR 40
A11 ANIMATOR 40
A111 Animator Cel 40
A112 Animator Pro Fli 40
A113 Animator Pro fLC 40
A111 Animator Cel 40
A114 Animator Pro PIC 40
A12 AUDIO MANAGER MODULE 41
A13 AUTODESK DRAWING INTERCHANGE FILES (DXF) (BINARY) 41
A14 CALIGARI TRUESPACE SCENEOBJECT (BINARY) 41
A15 CORELDRAW VECTOR GRAPHICS FILE 41
A16 COREL METAFILE EXCHANGE IMAGE 41
A17 MUSIC MODULES 41
A171 DigiTrekker Module Format 41
A172 Graoumf Tracker modules 41
A173 Oktalyzer 41
A18 EA MULTIMEDIA GAME FILE FORMATS 41
A181 Electronic Arts Music (asf) 41
A182 Electronic Arts CMV 41
A183 Electronic Arts DCT 41
A184 Electronic Arts MAD 42
A185 Electronic Arts MPC 42
A186 Electronic Arts VP7 42
A187 Electronic Arts TGV 42
A188 Electronic Arts TGQ 42
v
A189 Electronic Arts TQI 42
A19 IMAGINE 3D DATA DESCRIPTION (TDDD) 42
A20 LIGHTWAVE 3D OBJECTS 42
A201 LWOB 42
A202 LWO2 42
A203 LWLO 42
A21 LUXOLOGY MODO 3D OBJECT (LXOB) 42
A22 PAINT SHOP PRO (PSP) 42
A23 PROVECTOR 2D DRAWING 43
A24 PROWRITE DOCUMENT 43
A25 INTERACTIVEFICTION 43
A251 Blorb Z-machine Resource File 43
A252 Quetzal 43
A26 APPLE QUICKTIME 43
A27 MAYA IFF BITMAP 43
A28 APPLE CORE AUDIO FORMAT 43
A29 AMERICAN LASER GAMES (MM) 43
A30 SMUSH ANIMATION FORMAT 43
A301 SAN 43
A302 SNM 43
A31 STRUCTURED DATA EXCHANGE FORMAT (SDXF) 43
A32 OGG VORBIS 44
A33 SIMPLE VECTOR ARRAY FORMAT 44
A34 NESTED LIST FILE 44
A35 CINEMA 4D 44
A36 THE SIMS INTERCHANGE FILE FORMAT 44
vi
List of Figures
Figure 1 File format (Layout) of a RIFF container for a Webp Image 2
Figure 2 ILBM File hampic Displayed 7
Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format 7
Figure 4 A Parse Tree for the ILBM File hampic 8
Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format 10
Figure 6 An ANTLR Grammar for the ILBM File Format 16
Figure 7 A Java Application for Parsing an ILBM File 17
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF 18
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format 20
1
1 Introduction
11 Background
Automated tools are required for identifying and validating the formats of the huge number of
files ingested into digital data and record archives Automated tools are also needed for
viewingplaying text and binary file formats and for converting legacy and obsolete file formats
to standard or current formats Technologies such as data description languages have emerged to
address these preservation challenges [Dunckley et al 2007]
Context-free grammars have been used to specify the syntax of programming languages
Attribute grammars provide a framework for formally specifying the semantics of a language
based on its context-free grammar and for addressing the mildly context-sensitive features of
programming languages such as agreement of the data types of variables in expressions with data
type declarations Such grammars have parsingtranslation algorithms that can be used for
syntax-checking and interpretation or translation of the languages the grammars define
12 Purpose
The research question addressed by this research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files
13 Scope
The next section discusses the traditional approach to specifying file formats and two of the
major families of binary file formats In section 3 extensions to the concepts of context-free
grammars and attribute grammars are described that enable the specification of binary file
formats In section 4 examples of attribute array grammars for a chunk-based binary file formats
are presented In section 5 recursive descent parsers for these classes of grammars are described
In section 6 experience in using ANTLR a parser generator for LL() string grammars in
generating parsers for recognizing the formats of binary files is discussed In section 7 related
research is described Section 8 summarizes the results
2 Binary File Formats
Traditionally a file (or record) layout was a form showing how fields were positioned in a file
(or record) The fields are named and have as attributes a data type length and sometimes a
constant value Fields are offset at addresses relative to the beginning of a file File formats are
still specified in this manner For instance Figure 1 shows the format layout of a RIFF container
for the VP8 codec of a Webp image [Google 2010]
2
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk starting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
Figure 1 File format (Layout) of a RIFF container for a Webp Image
Many binary file formats are specified using pseudo-regular expressions or pseudo-EBNF
notation EBNF (Extended Backus-Naur Form) is a notation for expressing context-free
grammars in a compact human readable way Pseudo-EBNF is similar in concept to pseudocode
which is a high-level description of a computer algorithm that is intended for human
understanding but that omits details that would be necessary for computer execution The
pseudo-EBNF (or regular expression) specification of a file format is augmented with a natural
language description of the details that are not actually expressible in the EBNF (or regular
expression) notation These details are often the context-sensitive relationships of the size of an
array to its actual length or the relationship of an address pointer to the actual location of the
data pointed to in the file These context-sensitive relationships are not expressible in a context-
free grammar (or EBNF notation) It is a goal of this research to extend the EBNF notation and
concept of a context-free grammar to include the details that are necessary to precisely specify
the file formats of binary file formats that include these context-sensitive features
Binary file formats with a similar file structure are referred to as a family of file formats There
are two families of binary file formats that are readily distinguished the chunk-based and the
directory-based file formats
Chunk-based file formats were created by Electronic Arts and Commodore-Amiga as the
Interchange File Format (IFF) [Morrison 1985] An IFF file itself is one entire IFF chunk A
chunk consists of an ID tag a size and size bytes of data The data may include chunks called
subchunks Subchunks have the same structure as chunks Subchunks can have subchunks
Chunk-based file formats are primarily used as containers for multimedia but have also been
used for word processing files including pictures and other figures File format specifications for
chunk-based formats may use other terms than chunks in describing the format for instance
atoms in QuickTimeMP4 segments in JPEG and ―tagged data representations as in
AutoCAD DXF
3
The Audio Interchange File Format (AIFF) developed by Apple computer in 1988 was based on
Electronic Arts Interchange File Format Apple Core Audio Format (CAF) Applelsquos replacement
format for AIFF remains a chunk-based format [Apple Computer 2005] The Resource
Interchange File Format (RIFF) introduced by IBM and Microsoft [1991] is also based on
Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-Video
Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis WebP a picture format
recently introduced by Google[2010] also uses RIFF as a container
Microsoftlsquos Advanced Systems Format (ASF) [2004] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV) Microsoftlsquos Binary Interchange File Format [Rentz 2008]
(Microsoft Excellsquos File Format) is chunk-based
Another family of binary file formats is directory-based A directory-based file format consists of
one or more directory tables which contain one or more directory entries A directory entry
specifies where the actual data for a type of information is located This is the scheme used in
TIFF files OLE (Microsoft Object Linking and Embedding) files OASIS OpenDocument and
Microsoft Open Office files
3 Context-Free Grammars and Attribute Grammars
Context-free grammars have been widely used to define programming languages A context-free
grammar G is a quadruple ltN Σ S Pgt where
N is a finite set of non-terminal symbols
Σ is a finite set of terminal symbols
S ε N is the start symbol
P is a set of production rules of the form N rarr N Σ
The set of all strings over a vocabulary Σ is denoted by Σ The language generated by a context-
free grammar G is the set L(G) = w w Σ and S w
31 Context-Free Grammars for Arrays of Data Types
An array is a data structure consisting of a collection of elements (values or variables) each
identified by an index An array is stored so that the position of each element can be computed
from its index For example an array of 10 integer variables with indices 0 through 9 may be
4
stored as 10 (4 byte) words at file addresses 0 4 8 hellip36 so that the element with index i has the
address 4 times i
Arrays are used to implement many other data structures such as lists and strings In most
modern computers the internal memory and external memory of storage devices (and files on
those devices) is a one-dimensional array of data types whose indices are their addresses
A data type is a classification of one of various types of data such as floating-point integer
character pointer or Boolean that determines the possible values for that type the operations
that can be performed on values of that type and the way values of that type can be stored Data
types have names such as int16 int32 float char bool ptr We define a binary file to be an array
of values of data of various types
We define a context-free binary file grammar BG as a quintuple ltN D Σ S Pgt where
N is a finite set of non-terminal symbols
D is a set of data types
Σ is a finite set of binary values of data types D called terminals
S ε N is the start symbol
P is a set of production rules of the form N rarr N Σ
Let DataTypes indicate the union of all the values of all datatypes D The set of all binary files is
BinaryFiles = [IndicesrarrDataTypes] where Indices = 1 The language generated by a
context-free array (binary file) grammar AG is the set L(AG) = w w BinaryFiles and S w
By primitive nonterminal is meant a symbol which would be replaced by one or more terminal
symbols were an additional production rule applied In other words a primitive nonterminal is a
nonterminal that appears on the left-hand side of productions which only have one or more
terminal symbols on the right We adopt the notation shown below for primitive nonterminals
ltprimitive_nonterminal_symbol type=datatype_name constant=valuegt
32 Attribute Grammars
Context-free grammars cannot represent context-sensitive aspects of programming languages
such as (1) enforcing the constraint that all variables are declared before they are used or (2)
checking the number of parameters in a function call against the number in the functionlsquos
declaration Context-free grammars have also been used to define the syntax of English and other
natural languages but they cannot define such context-sensitivity as subject verb agreement in
English sentences A number of extensions of context-free grammars have been proposed to
5
address the context-sensitivity of programming languages and natural languages These include
indexed [Aho 1968] recording [Barth 1979] affix [Koster 1971] VW [van Wijngaarden 1974]
and attribute grammars [Knuth 1969 1971]
Context-free grammars also cannot represent the semantics of programming languages Knuth
[1968 1971] proposed an extension of context-free grammars termed attribute grammars that
addresses the semantics as well as the context-sensitivity of programming languages
An attribute grammar GA is a triple ltG A ARgt where G is a context-free grammar for the
language A associates each grammar symbol X isin (N Σ) with a set of attributes and AR
associates each production R isin P with a set of attribute computation rules A(X) where X isin (N
Σ) can be further partitioned into two sets synthesized attributes S(X) and inherited attributes
I(X) AR(R) where R isin P contains rules for computing inherited and synthesized attributes
associated with the symbols in the production R Synthesized attributes are evaluated and passed
up from deeper terminals and non-terminals while inherited attributes are passed down from
upper non-terminals
An L-attributed grammar is an attribute grammar that uses
i) synthesized attributes and
ii) inherited attributes where the inherited attribute depends only on
a) attributes of the other child nodes to its left
b) inherited attributes of P itself
L-attributed grammars allow the attributes to be evaluated in one left-to-right traversal of the
abstract syntax tree As a result attribute evaluation in L-attributed grammars can be
incorporated conveniently in top-down parsing
33 Semantic Grammar
An interpreter for a binary file format that is using a grammar to display or play the contents of a
file or to convert it to another file format needs to produce a semantic representation that is
appropriate to the task Natural language understanding components of dialog systems face a
similar need to produce a semantic representation that is appropriate to the dialogue task To
address this need some developers of dialogue systems make use of an approach called semantic
grammar [Issar and Ward 1993 Ward and Issar 1994] A semantic grammar is a CFG in which
the non-terminals on the left-hand side of rules correspond to semantic entities A parser for a
semantic grammar produces a labeling of the input string with semantic node labels for example
SHOW FLIGHTS ORIGIN DESTINATION DEPART_DATE DEPART_TIME Show me flights from Boston to San Francisco on Tuesday morning
6
A semantic grammar is just what we need to file in the values of a programming language data
structure Semantic grammars are not applicable to generic systems but are limited to specific
domains This is not a problem in specifying or parsing a binary file format since a grammar for
a binary file format is not meant to apply to all binary file formats but just to a single format or
to a family of file formats
4 Grammars for Chunk-based Binary File Formats
The attribute grammar that we present will be a mixture of synthesized semantic rules (attribute
values of the left side of a production are calculated from attribute values of symbols on the right
side) and inherited rules (attribute values of symbols on the right side of a production are
calculated from attribute values of the symbol on the left side)
The field names of the file format spec are represented as nonterminals in the grammar Constant
values in fields are terminals that appear on the right-hand side of productions Nonterminal
fieldnames have data types associated with them
41 InterLeaved BitMap (ILBM)
The author of the Interleaved Bitmap (ILBM) file format recognized that the structure of the file
format could be specified using pseudo EBNF rules and C data structure declarations [Morrison
1986] In this section the ―grammar of the ILBM file format specification is converted to a
context-free binary grammar with synthesized and inherited attributes
Figure 2 shows a display of an ILBM file hampic Figure 3 shows an attribute grammar for its
chunk-based binary file format The attribute grammar is a mixture of synthesized semantic rules
(attribute values of the left side of a production are calculated from attribute values of symbols
on the right side) and inherited rules (attribute values of symbols on the right side of a
production are calculated from attribute values of the symbol on the left side) The attribute
grammar is also an L-attributed grammar Figure 4 shows a parse tree for the file hampic
7
Figure 2 ILBM File hampic Displayed
ltILBMgt rarr ―FORM ltcksize TYPE=UINT32gt ―ILBM ltBMHDgt ltCMAPgt ltCAMGgt ltBODYgt
ILBMcksize = cksizevalue
ltBMHDgt rarr ―BMHD ltcksize TYPE=uint32gt ltBitmapHeadergt
ltBitMapheadergt rarr ltwidth type=UINT32gt
lt height type=UINT16gt
lt xposition type=INT16gt
ltyposition type=INT16gt
ltnplanes type=BYTEgt
ltmasking type=BYTEgt
ltcompression type=BYTEgt
ltreserved type=BYTEgt
lttransparentcolor type=UINT16gt
ltxaspect type=BYTEgt
ltyaspect type=BYTEgt
ltpagewidth type=INT16gt
ltpageheight type=INT16gt
ltCMAPgt -rarr―CMAP ltcksize type=UINT32gt ltcolorgt[n] n=cksizeval3
ltCAMGgt rarr ―CAMG ltcksizetype=UINT32gt ltviewmode type=INT32gt
ltBODYgt rarr ―BODY ltcksize type=UINT32gt ltdata type=BYTEgt[cksize]
ltcolorgt rarr ltred type=BYTEgt ltgreen type= BYTEgt ltblue type=BYTEgt
Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format
8
Figure 4 A Parse Tree for the ILBM File hampic
9
42 RIFF Waveform PCM Audio File Format (WAVE)
The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp
Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF
rules with the cksize inserted To convert this grammar into a binary file format grammar data
types are added for primitive non-terminals
ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE
ltfmt-ckgt
[ltfact-ckgt]
[ltcue-ckgt]
[ltplaylist-ckgt]
[ltassoc-data-listgt]
ltwave-datagt
ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt
ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt
ltwChannelsWORDgt
ltdwSamplesPerSecDWORDgt
ltdwAvgBytesPerSecDWORDgt
ltwBlockAlignWORDgt
ltwave-datagt -gt ltdata-ckgt | ltwave-listgt
ltdata-ckgt -gt data cksize ltwave-datagt
ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt
ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt
ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt
ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+
ltcue-pointgt -gt ltdwNameDWORDgt
ltdwPositionDWORDgt
ltfccChunkFOURCCgt
ltdwChunkStartDWORDgt
ltdwBlockStartDWORDgt
ltdwSampleOffsetDWORDgt
ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+
ltplay-segmentgt -gt ltdwNameDWORDgt
ltdwLengthDWORDgt
ltdwLoopsDWORD
ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt
ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt
ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt
10
ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt
ltdwSampleLengthDWORDgt
ltdwPurposeDWORDgt
ltwCountryWORDgt
ltwLanguageWORDgt
ltwDialectWORDgt
ltwCodePageWORDgt
ltdataBYTEgt+
ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt
ltfileDataBYTEgt+
Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format
5 Recursive Descent Parsers for Binary File Grammars
A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)
procedures where each procedure implements one of the production rules of the grammar A
predictive parser is a recursive descent parser that does not require backtracking Predictive
parsing is possible only for the class of LL(k) grammars which are the context-free grammars
for which there exists some positive integer k that allows a recursive descent parser to decide
which production to use by examining only the next k tokens of input A recursive descent parser
for the binary file format grammars defined in this paper does not require a separate lexer to
identify the data of a file The data types are predicted by the grammar rules
The semantic rules of an L-attributed grammar can be included in the recursive descent parser to
check for context-sensitive aspects of the grammar such as chunk size or image dimensions and
use these values in parsing the chunk data or images The pointers to file addresses that occur in
the file formats are handled by pushing the current address in the file onto a pushdown stack
When the parser exhausts the procedures implementing production rules corresponding to the
pointed to location the topmost address on the pushdown stack is popped and the procedure
corresponding to the nonterminal at that position is executed
Recursive descent parsers with a pushdown stack can be used with binary file format grammars
to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads
a file and generates an error if the file does not conform to the syntax specified by the grammar
11
6 A Parser Generator for Recognizing Binary File Formats
What we want is a parser generator whose input is a binary file attribute grammar for a particular
binary file format and whose generated output is the Java source code of a recursive descent
parser (with a pushdown stack if needed) for the class of binary file formats specified by the
grammar Such a parser generator does not yet exist but there is a widely used parser generator
for attribute grammars for string-based languages
61 ANTLR
ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing
[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a
language and generates as output source code for a recognizer for that language ANTLR
supports generating code in a number of the programming languages including C Java
JavaScript and Python
ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that
converts a sequence of characters into tokens A lexer is not needed to parse binary file formats
because the data types predicted by the parser are the tokens of the grammar Furthermore the
capability to recognize LL() grammars is not needed because the binary file format grammars
specified so far do not require lookahead of k symbols
62 Using ANTLR to Generate a Parser for a Binary File Format
It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-
based and directory-based binary file formats This is accomplished by writing a lexer that treats
each input byte in a file as a character token Then functions are created for each data type in the
binary file grammar that convert the appropriate number of character tokens to binary data types
eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our
attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a
parser generator for binary file grammars
621 An ANTLR Grammar for an ILBM Chunk-based File Format
Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of
ANTLR is a lexer and parser for ILBM files
12
grammar iblm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in lexerheader is placed at the top of the lexer file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Grammar Rules used to create the Parser
Start symbol ilbm
iblm returns [boolean result]
FORM ckSize ILBM
Print out ILBM chunksize
Systemoutprintln(ID ILBM Size + $ckSizevalue)
Semantic predicate that ensures that at least on BMHD is found
propertyChunks $propertyChunksfoundBMHD == true
dataChunks
body
$result = true
The bmhd cmap and camg property chunks can occur in any order
At least one BMHD data chunk must be present
propertyChunks returns [boolean foundBMHD]
(bmhd $foundBMHD = true
| cmap
| camg)+
bmhd
BMHD ckSize
Systemoutprintln(ID BMHD Size + $ckSizevalue)
13
bitMapHeader
bitMapHeader
width = uWordValue
height = uWordValue
xPosition = wordValue
yPosition = wordValue
nPanes = uByteValue
masking = uByteValue
compression = uByteValue
reserved1 = uByteValue
transparentColor = uWordValue
xAspect = uByteValue
yAspect = uByteValue
pageWidth = wordValue
pageHeight = wordValue
Print out values in the bitmap structure
Systemoutprintln( Width + $widthvalue)
Systemoutprintln( Height + $heightvalue)
Systemoutprintln( X Position + $xPositionvalue)
Systemoutprintln( Y Position + $yPositionvalue)
Systemoutprintln( Number of Panes + $nPanesvalue)
Systemoutprintln( Masking + $maskingvalue)
Systemoutprintln( Compression + $compressionvalue)
Systemoutprintln( Reserved1 + $reserved1value)
Systemoutprintln( TransParent Color +
$transparentColorvalue)
Systemoutprintln( X Aspest + $xAspectvalue)
Systemoutprintln( Y Aspect + $yAspectvalue)
Systemoutprintln( Page Width + $pageWidthvalue)
Systemoutprintln( Page Height + $pageHeightvalue)
cmap returns [long value]
Initialize the index value i in the cmap function
init int i = 0
CMAP ckSize
Print out chunk size
$value = $ckSizevalue
Systemoutprintln(ID CMAP Size + $ckSizevalue)
(colorRegister
Print out color values
Systemoutprintln( Color[ + i++ + ] +
$colorRegisterredValue +
+ $colorRegistergreenValue +
+ $colorRegisterblueValue +
)
)+
colorRegister returns [int redValue int greenValue int blueValue]
14
red = uByteValue green = uByteValue blue = uByteValue
$redValue = $redvalue
$greenValue = $greenvalue
$blueValue = $bluevalue
camg
CAMG ckSize longValue
Systemoutprintln(ID CAMG Size + $ckSizevalue)
Systemoutprintln( View Modes 0x +
LongtoHexString($longValuevalue))
The crng and ccrt data chunks
dataChunks
crng
| ccrt
crng
CRNG ckSize
Systemoutprintln(ID CRNG Size + $ckSizevalue)
cRange
cRange
pad1 = wordValue $pad1value == 0
rate = wordValue
active = wordValue
low = uByteValue
high = uByteValue
ccrt
CCRT ckSize
Systemoutprintln(ID CCRT Size + $ckSizevalue)
cycleInfo
cycleInfo
direction = wordValue
start = uByteValue
end = uByteValue
seconds = longValue
microseconds = longValue
pad = wordValue $padvalue == 0
body
BODY ckSize
Systemoutprintln(ID BODY Size + $ckSizevalue)
$ckSizevalue gt 0 data
Systemoutprintln( Number of bytes + $datavalue)
$datavalue == $ckSizevalue
defines the data chunk of the body chunk
data returns [long value]
(b+=BYTE)+ $bsize() gt 0
$value = $bsize()
15
The cksize is the value of a long integer
ckSize returns [long value]
longValue
$value = $longValuevalue
defines the longValue data typein terms of BYTE terminals
longValue returns [long value] throws UnsupportedEncodingException
h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE
$value = (long)
( getUByteValue($h1) ltlt 24 |
getUByteValue($h2) ltlt 16 |
getUByteValue($l1) ltlt 8 |
getUByteValue($l2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uWordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (short)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uByteValue returns [int value]
BYTE
calls the getUByteValue from a BYTE token
16
$value = (int) getUByteValue($BYTE)
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rules used to create the Lexer
FORM FORM
ILBM ILBM
BMHD BMHD
CMAP CMAP
CAMG CAMG
CRNG CRNG
CCRT CCRT
BODY BODY
BYTE
Figure 6 An ANTLR Grammar for the ILBM File Format
622 The Lexer and Parser
When provided the file described in the previous section ANTLR generates the ilbmlexer and
ilbmparser Java programs The Java application shown in Figure 6 uses these programs to
validate the ILBM file BLKIFF It produces the output shown in Figure 7
package orggtrigrammarschunk
import javaioIOException
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRStringStream
import organtlrruntimeCharStream
import organtlrruntimeCommonTokenStream
import organtlrruntimeRecognitionException
17
import organtlrruntimeTokenStream
import gtriorggrammarschunkiblmLexer
import gtriorggrammarschunkiblmParser
public class Test
static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF
public static void main(String[] args) throws RecognitionException IOException
ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)
iblmLexer lexer = new iblmLexer(stream)
TokenStream tokenStream = new CommonTokenStream(lexer)
iblmParser parser = new iblmParser(tokenStream)
Systemoutprintln(FILE + filePath)
Systemoutprintln(Successfully parsed = + parseriblm())
Figure 7 A Java Application for Parsing an ILBM File
FILE CUserssl170PicturesIFF-ilbmBLKIFF
ID ILBM Size 2556
ID BMHD Size 20
Width 200
Height 144
X Position 0
Y Position 0
Number of Panes 6
Masking0
Compression1
Reserved10
TransParent Color0
X Aspest3
Y Aspect3
Page Width200
Page Height144
ID CMAP Size 48
Color[0] 0 0 0
Color[1] 0 128 128
Color[2] 0 255 255
Color[3] 255 0 255
Color[4] 0 255 0
Color[5] 255 0 0
Color[6] 0 0 255
Color[7] 255 255 0
Color[8] 128 128 128
Color[9] 192 192 192
Color[10] 128 0 0
Color[11] 0 128 0
Color[12] 128 128 0
Color[13] 0 0 128
Color[14] 128 0 128
18
Color[15] 255 255 255
ID CAMG Size 4
View Modes 0x800
ID BODY Size 2448
Number of bytes 2448
Successfully parsed = true
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF
63 An ANTLR Grammar for the RIFF Waveform PCM File Format
Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR
The output of ANTLR is a lexer and parser for WAVEFOR PCM files
grammar waveform_pcm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text contained in lexerheader is placed at the top of the lexer
file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Default ByteOrder to Little Endian
String byteOrder = II
Rules used to create the Parser
start symbol waveform_pcm
waveform_pcm returns [boolean result]
19
ckId = fourCC $ckIdvalueequals(RIFF)
ckSize = dWordValue
formType = fourCC $formTypevalueequals(WAVE)
formatSubChunk
dataSubChunk
$result = true
fourCC returns [String value]
BYTE BYTE BYTE BYTE $value = $text
formatSubChunk
subChunkId = fourCC $subChunkIdvalueequals(fmt )
ckSize = dWordValue
commonFields
pcmSpecificField
commonFields
formatTag = wordValue Format category
channels = wordValue Number of channels
samplesPerSec = dWordValue Sampling rate
avgBytesPerSec = dWordValue For buffer estimation
blockAlign = wordValue Data block size
pcmSpecificField
bitsPerSample = wordValue Sample size
dataSubChunk
subChunkId = fourCC $subChunkIdvalueequals(data)
ckSize = dWordValue
waveData[$ckSizevalue]
waveData [long size]
(size gt= 0=gt BYTE size--)+
defines the long data type in terms of BYTE terminals
dWordValue returns [long value] throws UnsupportedEncodingException
b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE
if (byteOrderequals(II))
$value = (long)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 |
getUByteValue($b3) ltlt 16 |
getUByteValue($b4) ltlt 24 )
else
$value = (long)
( getUByteValue($b1) ltlt 24 |
getUByteValue($b2) ltlt 16 |
20
getUByteValue($b3) ltlt 8 |
getUByteValue($b4) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
if (byteOrderequals(II))
$value = (int)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 )
else
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rule used to create the Lexer
BYTE
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format
7 Related Research
Researchers have developed a number of data description languages to support validation of file
formats EAST (Enhanced ADA Subset) is a data description language developed by the
Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary
Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic
information The EAST description is used to interpret and provide access to information in
binary and text files
DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to
manipulate Java jar files and ELF object files The PADSC data description language can be
used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC
compiler compiles the description into tools that can be used to recognize manipulate and
transform the data into other formats
21
The Data Format Description Language (DFDL) is being developed by a Working Group of the
Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML
Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-
XML physical representations Version 1 of the language specification was published in 2011
and a parser is being implemented
Each of the data description languages described above can be used to define data types and file
structures of binary files However they are based on file structure declarations such as those of
C or Java The binary file format grammar described in this paper most closely resembles DFDL
However the binary file grammar described in this paper is the only data description language
based on formal grammars that is used for creating recognizers for file formats
JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to
provide automated and efficient identification and validation of the formats of digital files
JHOVE is a format-specific digital object validation API written in Java JHOVE supports
validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF
UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California
digital Library 2011]
The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation
of binary file formats As a matter of fact binary file grammars and parsers have been
constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-
based format TIFF However the research reported herein differs from that of the JHOVE
project in that what is sought is a technology for generating validators for binary file formats
from grammars specifying the binary file format
8 Conclusion
The research question addressed by his research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files Amajor family of binary file formats chunk-based was described Then extensions
to the concepts of context-free grammars and attribute grammars were described that enable the
specification of binary file formats Examples of attribute array grammars for two chunk-based
binary file format were then presented Recursive descent parsers for this classes of grammars
was then described Experience in using ANTLR a parser generator for LL() string grammars
in generating parsers for recognizing the formats of binary files was discussed Finally related
research in data description languages for binary file formats was described and the JHOVE
project in creating Java-based tools for validating file formats was described
22
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats Furthermore these grammars can be used with recursive descent
parsers for validating the file formats of chunk-based binary files It remains to be determined
whether these attribute grammars based on binary file (array) grammars are adequate to define
other families of binary file formats
To be a practical technology for generating recognizers (validators) for binary file formats a
parser generator is needed that creates from a binary file grammar a recursive descent parser
(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and
generates a lexer as well as a parser is too heavyweight a parser generator for this task and does
not have the requisite data types built in
A capability to validate binary file formats assumes that the file format of a file is already
known In this research we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009] We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendix a We also have file
signature tests (magic tests) for all these file types
We would like to have a parser generator (also called a compilerndashcompiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents The
closest parser generator that we have found to having these capabilities is ANTLR
(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar However it works with strings and generates a separate scanner
Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based
file formats We have specifications for each of file formats and samples of files with these
formats
These file formats have a similar structure We have developed binary file grammars for two of
these binary file formats We should be able to do so for all of them We used these grammars
with the extensions to ANTLR to generate top-down recursive descent parsers for the file
formats defined with these grammars
The significance of success in this research task is that if binary file formats can be specified
with binary file grammars then only one parsing generator is needed to generate the parsers for
verifying that a file conforms to a particular format Similarly the same parsing generator can be
used for conversion of legacy file formats to current or standard formats Finally the same parser
generator can be used for generating viewerplayers for most file formats This would increase
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
iv
A45 Windows Audio-Video Interleave (AVI) 37
A46 Animated Windows Cursor (ANI) 37
A47 RIFF MIDIfile (RMID) 37
A48 Device-Independent Bitmap (RIFF RDIB) 37
A49 WebP 37
A5 JPEG 38
A51 JPEGJFIF 38
A52 JPEGExif 38
A53 JPEG 2000 39
A54 JPEG 2000 Codestream 39
A6 ADVANCED SYSTEMS FORMAT 39
A61 Windows Media Audio 7-9 39
A62 Windows Media Video 7-91 39
A7 PORTABLE NETWORK GRAPHICS 39
A71 Multiple-image Network Graphics 40
A72 JPEG Network Graphics 40
A8 BINARY INTERCHANGE FILE FORMAT 40
A9 3D STUDIO 40
A91 3D Studio File Format ndash 3ds 40
A92 3D Studio Material Library Files 40
A10 ANIM8OR 40
A11 ANIMATOR 40
A111 Animator Cel 40
A112 Animator Pro Fli 40
A113 Animator Pro fLC 40
A111 Animator Cel 40
A114 Animator Pro PIC 40
A12 AUDIO MANAGER MODULE 41
A13 AUTODESK DRAWING INTERCHANGE FILES (DXF) (BINARY) 41
A14 CALIGARI TRUESPACE SCENEOBJECT (BINARY) 41
A15 CORELDRAW VECTOR GRAPHICS FILE 41
A16 COREL METAFILE EXCHANGE IMAGE 41
A17 MUSIC MODULES 41
A171 DigiTrekker Module Format 41
A172 Graoumf Tracker modules 41
A173 Oktalyzer 41
A18 EA MULTIMEDIA GAME FILE FORMATS 41
A181 Electronic Arts Music (asf) 41
A182 Electronic Arts CMV 41
A183 Electronic Arts DCT 41
A184 Electronic Arts MAD 42
A185 Electronic Arts MPC 42
A186 Electronic Arts VP7 42
A187 Electronic Arts TGV 42
A188 Electronic Arts TGQ 42
v
A189 Electronic Arts TQI 42
A19 IMAGINE 3D DATA DESCRIPTION (TDDD) 42
A20 LIGHTWAVE 3D OBJECTS 42
A201 LWOB 42
A202 LWO2 42
A203 LWLO 42
A21 LUXOLOGY MODO 3D OBJECT (LXOB) 42
A22 PAINT SHOP PRO (PSP) 42
A23 PROVECTOR 2D DRAWING 43
A24 PROWRITE DOCUMENT 43
A25 INTERACTIVEFICTION 43
A251 Blorb Z-machine Resource File 43
A252 Quetzal 43
A26 APPLE QUICKTIME 43
A27 MAYA IFF BITMAP 43
A28 APPLE CORE AUDIO FORMAT 43
A29 AMERICAN LASER GAMES (MM) 43
A30 SMUSH ANIMATION FORMAT 43
A301 SAN 43
A302 SNM 43
A31 STRUCTURED DATA EXCHANGE FORMAT (SDXF) 43
A32 OGG VORBIS 44
A33 SIMPLE VECTOR ARRAY FORMAT 44
A34 NESTED LIST FILE 44
A35 CINEMA 4D 44
A36 THE SIMS INTERCHANGE FILE FORMAT 44
vi
List of Figures
Figure 1 File format (Layout) of a RIFF container for a Webp Image 2
Figure 2 ILBM File hampic Displayed 7
Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format 7
Figure 4 A Parse Tree for the ILBM File hampic 8
Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format 10
Figure 6 An ANTLR Grammar for the ILBM File Format 16
Figure 7 A Java Application for Parsing an ILBM File 17
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF 18
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format 20
1
1 Introduction
11 Background
Automated tools are required for identifying and validating the formats of the huge number of
files ingested into digital data and record archives Automated tools are also needed for
viewingplaying text and binary file formats and for converting legacy and obsolete file formats
to standard or current formats Technologies such as data description languages have emerged to
address these preservation challenges [Dunckley et al 2007]
Context-free grammars have been used to specify the syntax of programming languages
Attribute grammars provide a framework for formally specifying the semantics of a language
based on its context-free grammar and for addressing the mildly context-sensitive features of
programming languages such as agreement of the data types of variables in expressions with data
type declarations Such grammars have parsingtranslation algorithms that can be used for
syntax-checking and interpretation or translation of the languages the grammars define
12 Purpose
The research question addressed by this research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files
13 Scope
The next section discusses the traditional approach to specifying file formats and two of the
major families of binary file formats In section 3 extensions to the concepts of context-free
grammars and attribute grammars are described that enable the specification of binary file
formats In section 4 examples of attribute array grammars for a chunk-based binary file formats
are presented In section 5 recursive descent parsers for these classes of grammars are described
In section 6 experience in using ANTLR a parser generator for LL() string grammars in
generating parsers for recognizing the formats of binary files is discussed In section 7 related
research is described Section 8 summarizes the results
2 Binary File Formats
Traditionally a file (or record) layout was a form showing how fields were positioned in a file
(or record) The fields are named and have as attributes a data type length and sometimes a
constant value Fields are offset at addresses relative to the beginning of a file File formats are
still specified in this manner For instance Figure 1 shows the format layout of a RIFF container
for the VP8 codec of a Webp image [Google 2010]
2
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk starting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
Figure 1 File format (Layout) of a RIFF container for a Webp Image
Many binary file formats are specified using pseudo-regular expressions or pseudo-EBNF
notation EBNF (Extended Backus-Naur Form) is a notation for expressing context-free
grammars in a compact human readable way Pseudo-EBNF is similar in concept to pseudocode
which is a high-level description of a computer algorithm that is intended for human
understanding but that omits details that would be necessary for computer execution The
pseudo-EBNF (or regular expression) specification of a file format is augmented with a natural
language description of the details that are not actually expressible in the EBNF (or regular
expression) notation These details are often the context-sensitive relationships of the size of an
array to its actual length or the relationship of an address pointer to the actual location of the
data pointed to in the file These context-sensitive relationships are not expressible in a context-
free grammar (or EBNF notation) It is a goal of this research to extend the EBNF notation and
concept of a context-free grammar to include the details that are necessary to precisely specify
the file formats of binary file formats that include these context-sensitive features
Binary file formats with a similar file structure are referred to as a family of file formats There
are two families of binary file formats that are readily distinguished the chunk-based and the
directory-based file formats
Chunk-based file formats were created by Electronic Arts and Commodore-Amiga as the
Interchange File Format (IFF) [Morrison 1985] An IFF file itself is one entire IFF chunk A
chunk consists of an ID tag a size and size bytes of data The data may include chunks called
subchunks Subchunks have the same structure as chunks Subchunks can have subchunks
Chunk-based file formats are primarily used as containers for multimedia but have also been
used for word processing files including pictures and other figures File format specifications for
chunk-based formats may use other terms than chunks in describing the format for instance
atoms in QuickTimeMP4 segments in JPEG and ―tagged data representations as in
AutoCAD DXF
3
The Audio Interchange File Format (AIFF) developed by Apple computer in 1988 was based on
Electronic Arts Interchange File Format Apple Core Audio Format (CAF) Applelsquos replacement
format for AIFF remains a chunk-based format [Apple Computer 2005] The Resource
Interchange File Format (RIFF) introduced by IBM and Microsoft [1991] is also based on
Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-Video
Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis WebP a picture format
recently introduced by Google[2010] also uses RIFF as a container
Microsoftlsquos Advanced Systems Format (ASF) [2004] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV) Microsoftlsquos Binary Interchange File Format [Rentz 2008]
(Microsoft Excellsquos File Format) is chunk-based
Another family of binary file formats is directory-based A directory-based file format consists of
one or more directory tables which contain one or more directory entries A directory entry
specifies where the actual data for a type of information is located This is the scheme used in
TIFF files OLE (Microsoft Object Linking and Embedding) files OASIS OpenDocument and
Microsoft Open Office files
3 Context-Free Grammars and Attribute Grammars
Context-free grammars have been widely used to define programming languages A context-free
grammar G is a quadruple ltN Σ S Pgt where
N is a finite set of non-terminal symbols
Σ is a finite set of terminal symbols
S ε N is the start symbol
P is a set of production rules of the form N rarr N Σ
The set of all strings over a vocabulary Σ is denoted by Σ The language generated by a context-
free grammar G is the set L(G) = w w Σ and S w
31 Context-Free Grammars for Arrays of Data Types
An array is a data structure consisting of a collection of elements (values or variables) each
identified by an index An array is stored so that the position of each element can be computed
from its index For example an array of 10 integer variables with indices 0 through 9 may be
4
stored as 10 (4 byte) words at file addresses 0 4 8 hellip36 so that the element with index i has the
address 4 times i
Arrays are used to implement many other data structures such as lists and strings In most
modern computers the internal memory and external memory of storage devices (and files on
those devices) is a one-dimensional array of data types whose indices are their addresses
A data type is a classification of one of various types of data such as floating-point integer
character pointer or Boolean that determines the possible values for that type the operations
that can be performed on values of that type and the way values of that type can be stored Data
types have names such as int16 int32 float char bool ptr We define a binary file to be an array
of values of data of various types
We define a context-free binary file grammar BG as a quintuple ltN D Σ S Pgt where
N is a finite set of non-terminal symbols
D is a set of data types
Σ is a finite set of binary values of data types D called terminals
S ε N is the start symbol
P is a set of production rules of the form N rarr N Σ
Let DataTypes indicate the union of all the values of all datatypes D The set of all binary files is
BinaryFiles = [IndicesrarrDataTypes] where Indices = 1 The language generated by a
context-free array (binary file) grammar AG is the set L(AG) = w w BinaryFiles and S w
By primitive nonterminal is meant a symbol which would be replaced by one or more terminal
symbols were an additional production rule applied In other words a primitive nonterminal is a
nonterminal that appears on the left-hand side of productions which only have one or more
terminal symbols on the right We adopt the notation shown below for primitive nonterminals
ltprimitive_nonterminal_symbol type=datatype_name constant=valuegt
32 Attribute Grammars
Context-free grammars cannot represent context-sensitive aspects of programming languages
such as (1) enforcing the constraint that all variables are declared before they are used or (2)
checking the number of parameters in a function call against the number in the functionlsquos
declaration Context-free grammars have also been used to define the syntax of English and other
natural languages but they cannot define such context-sensitivity as subject verb agreement in
English sentences A number of extensions of context-free grammars have been proposed to
5
address the context-sensitivity of programming languages and natural languages These include
indexed [Aho 1968] recording [Barth 1979] affix [Koster 1971] VW [van Wijngaarden 1974]
and attribute grammars [Knuth 1969 1971]
Context-free grammars also cannot represent the semantics of programming languages Knuth
[1968 1971] proposed an extension of context-free grammars termed attribute grammars that
addresses the semantics as well as the context-sensitivity of programming languages
An attribute grammar GA is a triple ltG A ARgt where G is a context-free grammar for the
language A associates each grammar symbol X isin (N Σ) with a set of attributes and AR
associates each production R isin P with a set of attribute computation rules A(X) where X isin (N
Σ) can be further partitioned into two sets synthesized attributes S(X) and inherited attributes
I(X) AR(R) where R isin P contains rules for computing inherited and synthesized attributes
associated with the symbols in the production R Synthesized attributes are evaluated and passed
up from deeper terminals and non-terminals while inherited attributes are passed down from
upper non-terminals
An L-attributed grammar is an attribute grammar that uses
i) synthesized attributes and
ii) inherited attributes where the inherited attribute depends only on
a) attributes of the other child nodes to its left
b) inherited attributes of P itself
L-attributed grammars allow the attributes to be evaluated in one left-to-right traversal of the
abstract syntax tree As a result attribute evaluation in L-attributed grammars can be
incorporated conveniently in top-down parsing
33 Semantic Grammar
An interpreter for a binary file format that is using a grammar to display or play the contents of a
file or to convert it to another file format needs to produce a semantic representation that is
appropriate to the task Natural language understanding components of dialog systems face a
similar need to produce a semantic representation that is appropriate to the dialogue task To
address this need some developers of dialogue systems make use of an approach called semantic
grammar [Issar and Ward 1993 Ward and Issar 1994] A semantic grammar is a CFG in which
the non-terminals on the left-hand side of rules correspond to semantic entities A parser for a
semantic grammar produces a labeling of the input string with semantic node labels for example
SHOW FLIGHTS ORIGIN DESTINATION DEPART_DATE DEPART_TIME Show me flights from Boston to San Francisco on Tuesday morning
6
A semantic grammar is just what we need to file in the values of a programming language data
structure Semantic grammars are not applicable to generic systems but are limited to specific
domains This is not a problem in specifying or parsing a binary file format since a grammar for
a binary file format is not meant to apply to all binary file formats but just to a single format or
to a family of file formats
4 Grammars for Chunk-based Binary File Formats
The attribute grammar that we present will be a mixture of synthesized semantic rules (attribute
values of the left side of a production are calculated from attribute values of symbols on the right
side) and inherited rules (attribute values of symbols on the right side of a production are
calculated from attribute values of the symbol on the left side)
The field names of the file format spec are represented as nonterminals in the grammar Constant
values in fields are terminals that appear on the right-hand side of productions Nonterminal
fieldnames have data types associated with them
41 InterLeaved BitMap (ILBM)
The author of the Interleaved Bitmap (ILBM) file format recognized that the structure of the file
format could be specified using pseudo EBNF rules and C data structure declarations [Morrison
1986] In this section the ―grammar of the ILBM file format specification is converted to a
context-free binary grammar with synthesized and inherited attributes
Figure 2 shows a display of an ILBM file hampic Figure 3 shows an attribute grammar for its
chunk-based binary file format The attribute grammar is a mixture of synthesized semantic rules
(attribute values of the left side of a production are calculated from attribute values of symbols
on the right side) and inherited rules (attribute values of symbols on the right side of a
production are calculated from attribute values of the symbol on the left side) The attribute
grammar is also an L-attributed grammar Figure 4 shows a parse tree for the file hampic
7
Figure 2 ILBM File hampic Displayed
ltILBMgt rarr ―FORM ltcksize TYPE=UINT32gt ―ILBM ltBMHDgt ltCMAPgt ltCAMGgt ltBODYgt
ILBMcksize = cksizevalue
ltBMHDgt rarr ―BMHD ltcksize TYPE=uint32gt ltBitmapHeadergt
ltBitMapheadergt rarr ltwidth type=UINT32gt
lt height type=UINT16gt
lt xposition type=INT16gt
ltyposition type=INT16gt
ltnplanes type=BYTEgt
ltmasking type=BYTEgt
ltcompression type=BYTEgt
ltreserved type=BYTEgt
lttransparentcolor type=UINT16gt
ltxaspect type=BYTEgt
ltyaspect type=BYTEgt
ltpagewidth type=INT16gt
ltpageheight type=INT16gt
ltCMAPgt -rarr―CMAP ltcksize type=UINT32gt ltcolorgt[n] n=cksizeval3
ltCAMGgt rarr ―CAMG ltcksizetype=UINT32gt ltviewmode type=INT32gt
ltBODYgt rarr ―BODY ltcksize type=UINT32gt ltdata type=BYTEgt[cksize]
ltcolorgt rarr ltred type=BYTEgt ltgreen type= BYTEgt ltblue type=BYTEgt
Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format
8
Figure 4 A Parse Tree for the ILBM File hampic
9
42 RIFF Waveform PCM Audio File Format (WAVE)
The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp
Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF
rules with the cksize inserted To convert this grammar into a binary file format grammar data
types are added for primitive non-terminals
ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE
ltfmt-ckgt
[ltfact-ckgt]
[ltcue-ckgt]
[ltplaylist-ckgt]
[ltassoc-data-listgt]
ltwave-datagt
ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt
ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt
ltwChannelsWORDgt
ltdwSamplesPerSecDWORDgt
ltdwAvgBytesPerSecDWORDgt
ltwBlockAlignWORDgt
ltwave-datagt -gt ltdata-ckgt | ltwave-listgt
ltdata-ckgt -gt data cksize ltwave-datagt
ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt
ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt
ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt
ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+
ltcue-pointgt -gt ltdwNameDWORDgt
ltdwPositionDWORDgt
ltfccChunkFOURCCgt
ltdwChunkStartDWORDgt
ltdwBlockStartDWORDgt
ltdwSampleOffsetDWORDgt
ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+
ltplay-segmentgt -gt ltdwNameDWORDgt
ltdwLengthDWORDgt
ltdwLoopsDWORD
ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt
ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt
ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt
10
ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt
ltdwSampleLengthDWORDgt
ltdwPurposeDWORDgt
ltwCountryWORDgt
ltwLanguageWORDgt
ltwDialectWORDgt
ltwCodePageWORDgt
ltdataBYTEgt+
ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt
ltfileDataBYTEgt+
Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format
5 Recursive Descent Parsers for Binary File Grammars
A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)
procedures where each procedure implements one of the production rules of the grammar A
predictive parser is a recursive descent parser that does not require backtracking Predictive
parsing is possible only for the class of LL(k) grammars which are the context-free grammars
for which there exists some positive integer k that allows a recursive descent parser to decide
which production to use by examining only the next k tokens of input A recursive descent parser
for the binary file format grammars defined in this paper does not require a separate lexer to
identify the data of a file The data types are predicted by the grammar rules
The semantic rules of an L-attributed grammar can be included in the recursive descent parser to
check for context-sensitive aspects of the grammar such as chunk size or image dimensions and
use these values in parsing the chunk data or images The pointers to file addresses that occur in
the file formats are handled by pushing the current address in the file onto a pushdown stack
When the parser exhausts the procedures implementing production rules corresponding to the
pointed to location the topmost address on the pushdown stack is popped and the procedure
corresponding to the nonterminal at that position is executed
Recursive descent parsers with a pushdown stack can be used with binary file format grammars
to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads
a file and generates an error if the file does not conform to the syntax specified by the grammar
11
6 A Parser Generator for Recognizing Binary File Formats
What we want is a parser generator whose input is a binary file attribute grammar for a particular
binary file format and whose generated output is the Java source code of a recursive descent
parser (with a pushdown stack if needed) for the class of binary file formats specified by the
grammar Such a parser generator does not yet exist but there is a widely used parser generator
for attribute grammars for string-based languages
61 ANTLR
ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing
[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a
language and generates as output source code for a recognizer for that language ANTLR
supports generating code in a number of the programming languages including C Java
JavaScript and Python
ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that
converts a sequence of characters into tokens A lexer is not needed to parse binary file formats
because the data types predicted by the parser are the tokens of the grammar Furthermore the
capability to recognize LL() grammars is not needed because the binary file format grammars
specified so far do not require lookahead of k symbols
62 Using ANTLR to Generate a Parser for a Binary File Format
It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-
based and directory-based binary file formats This is accomplished by writing a lexer that treats
each input byte in a file as a character token Then functions are created for each data type in the
binary file grammar that convert the appropriate number of character tokens to binary data types
eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our
attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a
parser generator for binary file grammars
621 An ANTLR Grammar for an ILBM Chunk-based File Format
Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of
ANTLR is a lexer and parser for ILBM files
12
grammar iblm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in lexerheader is placed at the top of the lexer file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Grammar Rules used to create the Parser
Start symbol ilbm
iblm returns [boolean result]
FORM ckSize ILBM
Print out ILBM chunksize
Systemoutprintln(ID ILBM Size + $ckSizevalue)
Semantic predicate that ensures that at least on BMHD is found
propertyChunks $propertyChunksfoundBMHD == true
dataChunks
body
$result = true
The bmhd cmap and camg property chunks can occur in any order
At least one BMHD data chunk must be present
propertyChunks returns [boolean foundBMHD]
(bmhd $foundBMHD = true
| cmap
| camg)+
bmhd
BMHD ckSize
Systemoutprintln(ID BMHD Size + $ckSizevalue)
13
bitMapHeader
bitMapHeader
width = uWordValue
height = uWordValue
xPosition = wordValue
yPosition = wordValue
nPanes = uByteValue
masking = uByteValue
compression = uByteValue
reserved1 = uByteValue
transparentColor = uWordValue
xAspect = uByteValue
yAspect = uByteValue
pageWidth = wordValue
pageHeight = wordValue
Print out values in the bitmap structure
Systemoutprintln( Width + $widthvalue)
Systemoutprintln( Height + $heightvalue)
Systemoutprintln( X Position + $xPositionvalue)
Systemoutprintln( Y Position + $yPositionvalue)
Systemoutprintln( Number of Panes + $nPanesvalue)
Systemoutprintln( Masking + $maskingvalue)
Systemoutprintln( Compression + $compressionvalue)
Systemoutprintln( Reserved1 + $reserved1value)
Systemoutprintln( TransParent Color +
$transparentColorvalue)
Systemoutprintln( X Aspest + $xAspectvalue)
Systemoutprintln( Y Aspect + $yAspectvalue)
Systemoutprintln( Page Width + $pageWidthvalue)
Systemoutprintln( Page Height + $pageHeightvalue)
cmap returns [long value]
Initialize the index value i in the cmap function
init int i = 0
CMAP ckSize
Print out chunk size
$value = $ckSizevalue
Systemoutprintln(ID CMAP Size + $ckSizevalue)
(colorRegister
Print out color values
Systemoutprintln( Color[ + i++ + ] +
$colorRegisterredValue +
+ $colorRegistergreenValue +
+ $colorRegisterblueValue +
)
)+
colorRegister returns [int redValue int greenValue int blueValue]
14
red = uByteValue green = uByteValue blue = uByteValue
$redValue = $redvalue
$greenValue = $greenvalue
$blueValue = $bluevalue
camg
CAMG ckSize longValue
Systemoutprintln(ID CAMG Size + $ckSizevalue)
Systemoutprintln( View Modes 0x +
LongtoHexString($longValuevalue))
The crng and ccrt data chunks
dataChunks
crng
| ccrt
crng
CRNG ckSize
Systemoutprintln(ID CRNG Size + $ckSizevalue)
cRange
cRange
pad1 = wordValue $pad1value == 0
rate = wordValue
active = wordValue
low = uByteValue
high = uByteValue
ccrt
CCRT ckSize
Systemoutprintln(ID CCRT Size + $ckSizevalue)
cycleInfo
cycleInfo
direction = wordValue
start = uByteValue
end = uByteValue
seconds = longValue
microseconds = longValue
pad = wordValue $padvalue == 0
body
BODY ckSize
Systemoutprintln(ID BODY Size + $ckSizevalue)
$ckSizevalue gt 0 data
Systemoutprintln( Number of bytes + $datavalue)
$datavalue == $ckSizevalue
defines the data chunk of the body chunk
data returns [long value]
(b+=BYTE)+ $bsize() gt 0
$value = $bsize()
15
The cksize is the value of a long integer
ckSize returns [long value]
longValue
$value = $longValuevalue
defines the longValue data typein terms of BYTE terminals
longValue returns [long value] throws UnsupportedEncodingException
h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE
$value = (long)
( getUByteValue($h1) ltlt 24 |
getUByteValue($h2) ltlt 16 |
getUByteValue($l1) ltlt 8 |
getUByteValue($l2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uWordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (short)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uByteValue returns [int value]
BYTE
calls the getUByteValue from a BYTE token
16
$value = (int) getUByteValue($BYTE)
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rules used to create the Lexer
FORM FORM
ILBM ILBM
BMHD BMHD
CMAP CMAP
CAMG CAMG
CRNG CRNG
CCRT CCRT
BODY BODY
BYTE
Figure 6 An ANTLR Grammar for the ILBM File Format
622 The Lexer and Parser
When provided the file described in the previous section ANTLR generates the ilbmlexer and
ilbmparser Java programs The Java application shown in Figure 6 uses these programs to
validate the ILBM file BLKIFF It produces the output shown in Figure 7
package orggtrigrammarschunk
import javaioIOException
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRStringStream
import organtlrruntimeCharStream
import organtlrruntimeCommonTokenStream
import organtlrruntimeRecognitionException
17
import organtlrruntimeTokenStream
import gtriorggrammarschunkiblmLexer
import gtriorggrammarschunkiblmParser
public class Test
static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF
public static void main(String[] args) throws RecognitionException IOException
ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)
iblmLexer lexer = new iblmLexer(stream)
TokenStream tokenStream = new CommonTokenStream(lexer)
iblmParser parser = new iblmParser(tokenStream)
Systemoutprintln(FILE + filePath)
Systemoutprintln(Successfully parsed = + parseriblm())
Figure 7 A Java Application for Parsing an ILBM File
FILE CUserssl170PicturesIFF-ilbmBLKIFF
ID ILBM Size 2556
ID BMHD Size 20
Width 200
Height 144
X Position 0
Y Position 0
Number of Panes 6
Masking0
Compression1
Reserved10
TransParent Color0
X Aspest3
Y Aspect3
Page Width200
Page Height144
ID CMAP Size 48
Color[0] 0 0 0
Color[1] 0 128 128
Color[2] 0 255 255
Color[3] 255 0 255
Color[4] 0 255 0
Color[5] 255 0 0
Color[6] 0 0 255
Color[7] 255 255 0
Color[8] 128 128 128
Color[9] 192 192 192
Color[10] 128 0 0
Color[11] 0 128 0
Color[12] 128 128 0
Color[13] 0 0 128
Color[14] 128 0 128
18
Color[15] 255 255 255
ID CAMG Size 4
View Modes 0x800
ID BODY Size 2448
Number of bytes 2448
Successfully parsed = true
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF
63 An ANTLR Grammar for the RIFF Waveform PCM File Format
Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR
The output of ANTLR is a lexer and parser for WAVEFOR PCM files
grammar waveform_pcm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text contained in lexerheader is placed at the top of the lexer
file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Default ByteOrder to Little Endian
String byteOrder = II
Rules used to create the Parser
start symbol waveform_pcm
waveform_pcm returns [boolean result]
19
ckId = fourCC $ckIdvalueequals(RIFF)
ckSize = dWordValue
formType = fourCC $formTypevalueequals(WAVE)
formatSubChunk
dataSubChunk
$result = true
fourCC returns [String value]
BYTE BYTE BYTE BYTE $value = $text
formatSubChunk
subChunkId = fourCC $subChunkIdvalueequals(fmt )
ckSize = dWordValue
commonFields
pcmSpecificField
commonFields
formatTag = wordValue Format category
channels = wordValue Number of channels
samplesPerSec = dWordValue Sampling rate
avgBytesPerSec = dWordValue For buffer estimation
blockAlign = wordValue Data block size
pcmSpecificField
bitsPerSample = wordValue Sample size
dataSubChunk
subChunkId = fourCC $subChunkIdvalueequals(data)
ckSize = dWordValue
waveData[$ckSizevalue]
waveData [long size]
(size gt= 0=gt BYTE size--)+
defines the long data type in terms of BYTE terminals
dWordValue returns [long value] throws UnsupportedEncodingException
b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE
if (byteOrderequals(II))
$value = (long)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 |
getUByteValue($b3) ltlt 16 |
getUByteValue($b4) ltlt 24 )
else
$value = (long)
( getUByteValue($b1) ltlt 24 |
getUByteValue($b2) ltlt 16 |
20
getUByteValue($b3) ltlt 8 |
getUByteValue($b4) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
if (byteOrderequals(II))
$value = (int)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 )
else
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rule used to create the Lexer
BYTE
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format
7 Related Research
Researchers have developed a number of data description languages to support validation of file
formats EAST (Enhanced ADA Subset) is a data description language developed by the
Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary
Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic
information The EAST description is used to interpret and provide access to information in
binary and text files
DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to
manipulate Java jar files and ELF object files The PADSC data description language can be
used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC
compiler compiles the description into tools that can be used to recognize manipulate and
transform the data into other formats
21
The Data Format Description Language (DFDL) is being developed by a Working Group of the
Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML
Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-
XML physical representations Version 1 of the language specification was published in 2011
and a parser is being implemented
Each of the data description languages described above can be used to define data types and file
structures of binary files However they are based on file structure declarations such as those of
C or Java The binary file format grammar described in this paper most closely resembles DFDL
However the binary file grammar described in this paper is the only data description language
based on formal grammars that is used for creating recognizers for file formats
JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to
provide automated and efficient identification and validation of the formats of digital files
JHOVE is a format-specific digital object validation API written in Java JHOVE supports
validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF
UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California
digital Library 2011]
The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation
of binary file formats As a matter of fact binary file grammars and parsers have been
constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-
based format TIFF However the research reported herein differs from that of the JHOVE
project in that what is sought is a technology for generating validators for binary file formats
from grammars specifying the binary file format
8 Conclusion
The research question addressed by his research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files Amajor family of binary file formats chunk-based was described Then extensions
to the concepts of context-free grammars and attribute grammars were described that enable the
specification of binary file formats Examples of attribute array grammars for two chunk-based
binary file format were then presented Recursive descent parsers for this classes of grammars
was then described Experience in using ANTLR a parser generator for LL() string grammars
in generating parsers for recognizing the formats of binary files was discussed Finally related
research in data description languages for binary file formats was described and the JHOVE
project in creating Java-based tools for validating file formats was described
22
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats Furthermore these grammars can be used with recursive descent
parsers for validating the file formats of chunk-based binary files It remains to be determined
whether these attribute grammars based on binary file (array) grammars are adequate to define
other families of binary file formats
To be a practical technology for generating recognizers (validators) for binary file formats a
parser generator is needed that creates from a binary file grammar a recursive descent parser
(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and
generates a lexer as well as a parser is too heavyweight a parser generator for this task and does
not have the requisite data types built in
A capability to validate binary file formats assumes that the file format of a file is already
known In this research we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009] We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendix a We also have file
signature tests (magic tests) for all these file types
We would like to have a parser generator (also called a compilerndashcompiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents The
closest parser generator that we have found to having these capabilities is ANTLR
(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar However it works with strings and generates a separate scanner
Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based
file formats We have specifications for each of file formats and samples of files with these
formats
These file formats have a similar structure We have developed binary file grammars for two of
these binary file formats We should be able to do so for all of them We used these grammars
with the extensions to ANTLR to generate top-down recursive descent parsers for the file
formats defined with these grammars
The significance of success in this research task is that if binary file formats can be specified
with binary file grammars then only one parsing generator is needed to generate the parsers for
verifying that a file conforms to a particular format Similarly the same parsing generator can be
used for conversion of legacy file formats to current or standard formats Finally the same parser
generator can be used for generating viewerplayers for most file formats This would increase
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
v
A189 Electronic Arts TQI 42
A19 IMAGINE 3D DATA DESCRIPTION (TDDD) 42
A20 LIGHTWAVE 3D OBJECTS 42
A201 LWOB 42
A202 LWO2 42
A203 LWLO 42
A21 LUXOLOGY MODO 3D OBJECT (LXOB) 42
A22 PAINT SHOP PRO (PSP) 42
A23 PROVECTOR 2D DRAWING 43
A24 PROWRITE DOCUMENT 43
A25 INTERACTIVEFICTION 43
A251 Blorb Z-machine Resource File 43
A252 Quetzal 43
A26 APPLE QUICKTIME 43
A27 MAYA IFF BITMAP 43
A28 APPLE CORE AUDIO FORMAT 43
A29 AMERICAN LASER GAMES (MM) 43
A30 SMUSH ANIMATION FORMAT 43
A301 SAN 43
A302 SNM 43
A31 STRUCTURED DATA EXCHANGE FORMAT (SDXF) 43
A32 OGG VORBIS 44
A33 SIMPLE VECTOR ARRAY FORMAT 44
A34 NESTED LIST FILE 44
A35 CINEMA 4D 44
A36 THE SIMS INTERCHANGE FILE FORMAT 44
vi
List of Figures
Figure 1 File format (Layout) of a RIFF container for a Webp Image 2
Figure 2 ILBM File hampic Displayed 7
Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format 7
Figure 4 A Parse Tree for the ILBM File hampic 8
Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format 10
Figure 6 An ANTLR Grammar for the ILBM File Format 16
Figure 7 A Java Application for Parsing an ILBM File 17
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF 18
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format 20
1
1 Introduction
11 Background
Automated tools are required for identifying and validating the formats of the huge number of
files ingested into digital data and record archives Automated tools are also needed for
viewingplaying text and binary file formats and for converting legacy and obsolete file formats
to standard or current formats Technologies such as data description languages have emerged to
address these preservation challenges [Dunckley et al 2007]
Context-free grammars have been used to specify the syntax of programming languages
Attribute grammars provide a framework for formally specifying the semantics of a language
based on its context-free grammar and for addressing the mildly context-sensitive features of
programming languages such as agreement of the data types of variables in expressions with data
type declarations Such grammars have parsingtranslation algorithms that can be used for
syntax-checking and interpretation or translation of the languages the grammars define
12 Purpose
The research question addressed by this research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files
13 Scope
The next section discusses the traditional approach to specifying file formats and two of the
major families of binary file formats In section 3 extensions to the concepts of context-free
grammars and attribute grammars are described that enable the specification of binary file
formats In section 4 examples of attribute array grammars for a chunk-based binary file formats
are presented In section 5 recursive descent parsers for these classes of grammars are described
In section 6 experience in using ANTLR a parser generator for LL() string grammars in
generating parsers for recognizing the formats of binary files is discussed In section 7 related
research is described Section 8 summarizes the results
2 Binary File Formats
Traditionally a file (or record) layout was a form showing how fields were positioned in a file
(or record) The fields are named and have as attributes a data type length and sometimes a
constant value Fields are offset at addresses relative to the beginning of a file File formats are
still specified in this manner For instance Figure 1 shows the format layout of a RIFF container
for the VP8 codec of a Webp image [Google 2010]
2
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk starting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
Figure 1 File format (Layout) of a RIFF container for a Webp Image
Many binary file formats are specified using pseudo-regular expressions or pseudo-EBNF
notation EBNF (Extended Backus-Naur Form) is a notation for expressing context-free
grammars in a compact human readable way Pseudo-EBNF is similar in concept to pseudocode
which is a high-level description of a computer algorithm that is intended for human
understanding but that omits details that would be necessary for computer execution The
pseudo-EBNF (or regular expression) specification of a file format is augmented with a natural
language description of the details that are not actually expressible in the EBNF (or regular
expression) notation These details are often the context-sensitive relationships of the size of an
array to its actual length or the relationship of an address pointer to the actual location of the
data pointed to in the file These context-sensitive relationships are not expressible in a context-
free grammar (or EBNF notation) It is a goal of this research to extend the EBNF notation and
concept of a context-free grammar to include the details that are necessary to precisely specify
the file formats of binary file formats that include these context-sensitive features
Binary file formats with a similar file structure are referred to as a family of file formats There
are two families of binary file formats that are readily distinguished the chunk-based and the
directory-based file formats
Chunk-based file formats were created by Electronic Arts and Commodore-Amiga as the
Interchange File Format (IFF) [Morrison 1985] An IFF file itself is one entire IFF chunk A
chunk consists of an ID tag a size and size bytes of data The data may include chunks called
subchunks Subchunks have the same structure as chunks Subchunks can have subchunks
Chunk-based file formats are primarily used as containers for multimedia but have also been
used for word processing files including pictures and other figures File format specifications for
chunk-based formats may use other terms than chunks in describing the format for instance
atoms in QuickTimeMP4 segments in JPEG and ―tagged data representations as in
AutoCAD DXF
3
The Audio Interchange File Format (AIFF) developed by Apple computer in 1988 was based on
Electronic Arts Interchange File Format Apple Core Audio Format (CAF) Applelsquos replacement
format for AIFF remains a chunk-based format [Apple Computer 2005] The Resource
Interchange File Format (RIFF) introduced by IBM and Microsoft [1991] is also based on
Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-Video
Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis WebP a picture format
recently introduced by Google[2010] also uses RIFF as a container
Microsoftlsquos Advanced Systems Format (ASF) [2004] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV) Microsoftlsquos Binary Interchange File Format [Rentz 2008]
(Microsoft Excellsquos File Format) is chunk-based
Another family of binary file formats is directory-based A directory-based file format consists of
one or more directory tables which contain one or more directory entries A directory entry
specifies where the actual data for a type of information is located This is the scheme used in
TIFF files OLE (Microsoft Object Linking and Embedding) files OASIS OpenDocument and
Microsoft Open Office files
3 Context-Free Grammars and Attribute Grammars
Context-free grammars have been widely used to define programming languages A context-free
grammar G is a quadruple ltN Σ S Pgt where
N is a finite set of non-terminal symbols
Σ is a finite set of terminal symbols
S ε N is the start symbol
P is a set of production rules of the form N rarr N Σ
The set of all strings over a vocabulary Σ is denoted by Σ The language generated by a context-
free grammar G is the set L(G) = w w Σ and S w
31 Context-Free Grammars for Arrays of Data Types
An array is a data structure consisting of a collection of elements (values or variables) each
identified by an index An array is stored so that the position of each element can be computed
from its index For example an array of 10 integer variables with indices 0 through 9 may be
4
stored as 10 (4 byte) words at file addresses 0 4 8 hellip36 so that the element with index i has the
address 4 times i
Arrays are used to implement many other data structures such as lists and strings In most
modern computers the internal memory and external memory of storage devices (and files on
those devices) is a one-dimensional array of data types whose indices are their addresses
A data type is a classification of one of various types of data such as floating-point integer
character pointer or Boolean that determines the possible values for that type the operations
that can be performed on values of that type and the way values of that type can be stored Data
types have names such as int16 int32 float char bool ptr We define a binary file to be an array
of values of data of various types
We define a context-free binary file grammar BG as a quintuple ltN D Σ S Pgt where
N is a finite set of non-terminal symbols
D is a set of data types
Σ is a finite set of binary values of data types D called terminals
S ε N is the start symbol
P is a set of production rules of the form N rarr N Σ
Let DataTypes indicate the union of all the values of all datatypes D The set of all binary files is
BinaryFiles = [IndicesrarrDataTypes] where Indices = 1 The language generated by a
context-free array (binary file) grammar AG is the set L(AG) = w w BinaryFiles and S w
By primitive nonterminal is meant a symbol which would be replaced by one or more terminal
symbols were an additional production rule applied In other words a primitive nonterminal is a
nonterminal that appears on the left-hand side of productions which only have one or more
terminal symbols on the right We adopt the notation shown below for primitive nonterminals
ltprimitive_nonterminal_symbol type=datatype_name constant=valuegt
32 Attribute Grammars
Context-free grammars cannot represent context-sensitive aspects of programming languages
such as (1) enforcing the constraint that all variables are declared before they are used or (2)
checking the number of parameters in a function call against the number in the functionlsquos
declaration Context-free grammars have also been used to define the syntax of English and other
natural languages but they cannot define such context-sensitivity as subject verb agreement in
English sentences A number of extensions of context-free grammars have been proposed to
5
address the context-sensitivity of programming languages and natural languages These include
indexed [Aho 1968] recording [Barth 1979] affix [Koster 1971] VW [van Wijngaarden 1974]
and attribute grammars [Knuth 1969 1971]
Context-free grammars also cannot represent the semantics of programming languages Knuth
[1968 1971] proposed an extension of context-free grammars termed attribute grammars that
addresses the semantics as well as the context-sensitivity of programming languages
An attribute grammar GA is a triple ltG A ARgt where G is a context-free grammar for the
language A associates each grammar symbol X isin (N Σ) with a set of attributes and AR
associates each production R isin P with a set of attribute computation rules A(X) where X isin (N
Σ) can be further partitioned into two sets synthesized attributes S(X) and inherited attributes
I(X) AR(R) where R isin P contains rules for computing inherited and synthesized attributes
associated with the symbols in the production R Synthesized attributes are evaluated and passed
up from deeper terminals and non-terminals while inherited attributes are passed down from
upper non-terminals
An L-attributed grammar is an attribute grammar that uses
i) synthesized attributes and
ii) inherited attributes where the inherited attribute depends only on
a) attributes of the other child nodes to its left
b) inherited attributes of P itself
L-attributed grammars allow the attributes to be evaluated in one left-to-right traversal of the
abstract syntax tree As a result attribute evaluation in L-attributed grammars can be
incorporated conveniently in top-down parsing
33 Semantic Grammar
An interpreter for a binary file format that is using a grammar to display or play the contents of a
file or to convert it to another file format needs to produce a semantic representation that is
appropriate to the task Natural language understanding components of dialog systems face a
similar need to produce a semantic representation that is appropriate to the dialogue task To
address this need some developers of dialogue systems make use of an approach called semantic
grammar [Issar and Ward 1993 Ward and Issar 1994] A semantic grammar is a CFG in which
the non-terminals on the left-hand side of rules correspond to semantic entities A parser for a
semantic grammar produces a labeling of the input string with semantic node labels for example
SHOW FLIGHTS ORIGIN DESTINATION DEPART_DATE DEPART_TIME Show me flights from Boston to San Francisco on Tuesday morning
6
A semantic grammar is just what we need to file in the values of a programming language data
structure Semantic grammars are not applicable to generic systems but are limited to specific
domains This is not a problem in specifying or parsing a binary file format since a grammar for
a binary file format is not meant to apply to all binary file formats but just to a single format or
to a family of file formats
4 Grammars for Chunk-based Binary File Formats
The attribute grammar that we present will be a mixture of synthesized semantic rules (attribute
values of the left side of a production are calculated from attribute values of symbols on the right
side) and inherited rules (attribute values of symbols on the right side of a production are
calculated from attribute values of the symbol on the left side)
The field names of the file format spec are represented as nonterminals in the grammar Constant
values in fields are terminals that appear on the right-hand side of productions Nonterminal
fieldnames have data types associated with them
41 InterLeaved BitMap (ILBM)
The author of the Interleaved Bitmap (ILBM) file format recognized that the structure of the file
format could be specified using pseudo EBNF rules and C data structure declarations [Morrison
1986] In this section the ―grammar of the ILBM file format specification is converted to a
context-free binary grammar with synthesized and inherited attributes
Figure 2 shows a display of an ILBM file hampic Figure 3 shows an attribute grammar for its
chunk-based binary file format The attribute grammar is a mixture of synthesized semantic rules
(attribute values of the left side of a production are calculated from attribute values of symbols
on the right side) and inherited rules (attribute values of symbols on the right side of a
production are calculated from attribute values of the symbol on the left side) The attribute
grammar is also an L-attributed grammar Figure 4 shows a parse tree for the file hampic
7
Figure 2 ILBM File hampic Displayed
ltILBMgt rarr ―FORM ltcksize TYPE=UINT32gt ―ILBM ltBMHDgt ltCMAPgt ltCAMGgt ltBODYgt
ILBMcksize = cksizevalue
ltBMHDgt rarr ―BMHD ltcksize TYPE=uint32gt ltBitmapHeadergt
ltBitMapheadergt rarr ltwidth type=UINT32gt
lt height type=UINT16gt
lt xposition type=INT16gt
ltyposition type=INT16gt
ltnplanes type=BYTEgt
ltmasking type=BYTEgt
ltcompression type=BYTEgt
ltreserved type=BYTEgt
lttransparentcolor type=UINT16gt
ltxaspect type=BYTEgt
ltyaspect type=BYTEgt
ltpagewidth type=INT16gt
ltpageheight type=INT16gt
ltCMAPgt -rarr―CMAP ltcksize type=UINT32gt ltcolorgt[n] n=cksizeval3
ltCAMGgt rarr ―CAMG ltcksizetype=UINT32gt ltviewmode type=INT32gt
ltBODYgt rarr ―BODY ltcksize type=UINT32gt ltdata type=BYTEgt[cksize]
ltcolorgt rarr ltred type=BYTEgt ltgreen type= BYTEgt ltblue type=BYTEgt
Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format
8
Figure 4 A Parse Tree for the ILBM File hampic
9
42 RIFF Waveform PCM Audio File Format (WAVE)
The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp
Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF
rules with the cksize inserted To convert this grammar into a binary file format grammar data
types are added for primitive non-terminals
ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE
ltfmt-ckgt
[ltfact-ckgt]
[ltcue-ckgt]
[ltplaylist-ckgt]
[ltassoc-data-listgt]
ltwave-datagt
ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt
ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt
ltwChannelsWORDgt
ltdwSamplesPerSecDWORDgt
ltdwAvgBytesPerSecDWORDgt
ltwBlockAlignWORDgt
ltwave-datagt -gt ltdata-ckgt | ltwave-listgt
ltdata-ckgt -gt data cksize ltwave-datagt
ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt
ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt
ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt
ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+
ltcue-pointgt -gt ltdwNameDWORDgt
ltdwPositionDWORDgt
ltfccChunkFOURCCgt
ltdwChunkStartDWORDgt
ltdwBlockStartDWORDgt
ltdwSampleOffsetDWORDgt
ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+
ltplay-segmentgt -gt ltdwNameDWORDgt
ltdwLengthDWORDgt
ltdwLoopsDWORD
ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt
ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt
ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt
10
ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt
ltdwSampleLengthDWORDgt
ltdwPurposeDWORDgt
ltwCountryWORDgt
ltwLanguageWORDgt
ltwDialectWORDgt
ltwCodePageWORDgt
ltdataBYTEgt+
ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt
ltfileDataBYTEgt+
Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format
5 Recursive Descent Parsers for Binary File Grammars
A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)
procedures where each procedure implements one of the production rules of the grammar A
predictive parser is a recursive descent parser that does not require backtracking Predictive
parsing is possible only for the class of LL(k) grammars which are the context-free grammars
for which there exists some positive integer k that allows a recursive descent parser to decide
which production to use by examining only the next k tokens of input A recursive descent parser
for the binary file format grammars defined in this paper does not require a separate lexer to
identify the data of a file The data types are predicted by the grammar rules
The semantic rules of an L-attributed grammar can be included in the recursive descent parser to
check for context-sensitive aspects of the grammar such as chunk size or image dimensions and
use these values in parsing the chunk data or images The pointers to file addresses that occur in
the file formats are handled by pushing the current address in the file onto a pushdown stack
When the parser exhausts the procedures implementing production rules corresponding to the
pointed to location the topmost address on the pushdown stack is popped and the procedure
corresponding to the nonterminal at that position is executed
Recursive descent parsers with a pushdown stack can be used with binary file format grammars
to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads
a file and generates an error if the file does not conform to the syntax specified by the grammar
11
6 A Parser Generator for Recognizing Binary File Formats
What we want is a parser generator whose input is a binary file attribute grammar for a particular
binary file format and whose generated output is the Java source code of a recursive descent
parser (with a pushdown stack if needed) for the class of binary file formats specified by the
grammar Such a parser generator does not yet exist but there is a widely used parser generator
for attribute grammars for string-based languages
61 ANTLR
ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing
[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a
language and generates as output source code for a recognizer for that language ANTLR
supports generating code in a number of the programming languages including C Java
JavaScript and Python
ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that
converts a sequence of characters into tokens A lexer is not needed to parse binary file formats
because the data types predicted by the parser are the tokens of the grammar Furthermore the
capability to recognize LL() grammars is not needed because the binary file format grammars
specified so far do not require lookahead of k symbols
62 Using ANTLR to Generate a Parser for a Binary File Format
It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-
based and directory-based binary file formats This is accomplished by writing a lexer that treats
each input byte in a file as a character token Then functions are created for each data type in the
binary file grammar that convert the appropriate number of character tokens to binary data types
eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our
attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a
parser generator for binary file grammars
621 An ANTLR Grammar for an ILBM Chunk-based File Format
Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of
ANTLR is a lexer and parser for ILBM files
12
grammar iblm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in lexerheader is placed at the top of the lexer file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Grammar Rules used to create the Parser
Start symbol ilbm
iblm returns [boolean result]
FORM ckSize ILBM
Print out ILBM chunksize
Systemoutprintln(ID ILBM Size + $ckSizevalue)
Semantic predicate that ensures that at least on BMHD is found
propertyChunks $propertyChunksfoundBMHD == true
dataChunks
body
$result = true
The bmhd cmap and camg property chunks can occur in any order
At least one BMHD data chunk must be present
propertyChunks returns [boolean foundBMHD]
(bmhd $foundBMHD = true
| cmap
| camg)+
bmhd
BMHD ckSize
Systemoutprintln(ID BMHD Size + $ckSizevalue)
13
bitMapHeader
bitMapHeader
width = uWordValue
height = uWordValue
xPosition = wordValue
yPosition = wordValue
nPanes = uByteValue
masking = uByteValue
compression = uByteValue
reserved1 = uByteValue
transparentColor = uWordValue
xAspect = uByteValue
yAspect = uByteValue
pageWidth = wordValue
pageHeight = wordValue
Print out values in the bitmap structure
Systemoutprintln( Width + $widthvalue)
Systemoutprintln( Height + $heightvalue)
Systemoutprintln( X Position + $xPositionvalue)
Systemoutprintln( Y Position + $yPositionvalue)
Systemoutprintln( Number of Panes + $nPanesvalue)
Systemoutprintln( Masking + $maskingvalue)
Systemoutprintln( Compression + $compressionvalue)
Systemoutprintln( Reserved1 + $reserved1value)
Systemoutprintln( TransParent Color +
$transparentColorvalue)
Systemoutprintln( X Aspest + $xAspectvalue)
Systemoutprintln( Y Aspect + $yAspectvalue)
Systemoutprintln( Page Width + $pageWidthvalue)
Systemoutprintln( Page Height + $pageHeightvalue)
cmap returns [long value]
Initialize the index value i in the cmap function
init int i = 0
CMAP ckSize
Print out chunk size
$value = $ckSizevalue
Systemoutprintln(ID CMAP Size + $ckSizevalue)
(colorRegister
Print out color values
Systemoutprintln( Color[ + i++ + ] +
$colorRegisterredValue +
+ $colorRegistergreenValue +
+ $colorRegisterblueValue +
)
)+
colorRegister returns [int redValue int greenValue int blueValue]
14
red = uByteValue green = uByteValue blue = uByteValue
$redValue = $redvalue
$greenValue = $greenvalue
$blueValue = $bluevalue
camg
CAMG ckSize longValue
Systemoutprintln(ID CAMG Size + $ckSizevalue)
Systemoutprintln( View Modes 0x +
LongtoHexString($longValuevalue))
The crng and ccrt data chunks
dataChunks
crng
| ccrt
crng
CRNG ckSize
Systemoutprintln(ID CRNG Size + $ckSizevalue)
cRange
cRange
pad1 = wordValue $pad1value == 0
rate = wordValue
active = wordValue
low = uByteValue
high = uByteValue
ccrt
CCRT ckSize
Systemoutprintln(ID CCRT Size + $ckSizevalue)
cycleInfo
cycleInfo
direction = wordValue
start = uByteValue
end = uByteValue
seconds = longValue
microseconds = longValue
pad = wordValue $padvalue == 0
body
BODY ckSize
Systemoutprintln(ID BODY Size + $ckSizevalue)
$ckSizevalue gt 0 data
Systemoutprintln( Number of bytes + $datavalue)
$datavalue == $ckSizevalue
defines the data chunk of the body chunk
data returns [long value]
(b+=BYTE)+ $bsize() gt 0
$value = $bsize()
15
The cksize is the value of a long integer
ckSize returns [long value]
longValue
$value = $longValuevalue
defines the longValue data typein terms of BYTE terminals
longValue returns [long value] throws UnsupportedEncodingException
h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE
$value = (long)
( getUByteValue($h1) ltlt 24 |
getUByteValue($h2) ltlt 16 |
getUByteValue($l1) ltlt 8 |
getUByteValue($l2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uWordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (short)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uByteValue returns [int value]
BYTE
calls the getUByteValue from a BYTE token
16
$value = (int) getUByteValue($BYTE)
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rules used to create the Lexer
FORM FORM
ILBM ILBM
BMHD BMHD
CMAP CMAP
CAMG CAMG
CRNG CRNG
CCRT CCRT
BODY BODY
BYTE
Figure 6 An ANTLR Grammar for the ILBM File Format
622 The Lexer and Parser
When provided the file described in the previous section ANTLR generates the ilbmlexer and
ilbmparser Java programs The Java application shown in Figure 6 uses these programs to
validate the ILBM file BLKIFF It produces the output shown in Figure 7
package orggtrigrammarschunk
import javaioIOException
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRStringStream
import organtlrruntimeCharStream
import organtlrruntimeCommonTokenStream
import organtlrruntimeRecognitionException
17
import organtlrruntimeTokenStream
import gtriorggrammarschunkiblmLexer
import gtriorggrammarschunkiblmParser
public class Test
static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF
public static void main(String[] args) throws RecognitionException IOException
ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)
iblmLexer lexer = new iblmLexer(stream)
TokenStream tokenStream = new CommonTokenStream(lexer)
iblmParser parser = new iblmParser(tokenStream)
Systemoutprintln(FILE + filePath)
Systemoutprintln(Successfully parsed = + parseriblm())
Figure 7 A Java Application for Parsing an ILBM File
FILE CUserssl170PicturesIFF-ilbmBLKIFF
ID ILBM Size 2556
ID BMHD Size 20
Width 200
Height 144
X Position 0
Y Position 0
Number of Panes 6
Masking0
Compression1
Reserved10
TransParent Color0
X Aspest3
Y Aspect3
Page Width200
Page Height144
ID CMAP Size 48
Color[0] 0 0 0
Color[1] 0 128 128
Color[2] 0 255 255
Color[3] 255 0 255
Color[4] 0 255 0
Color[5] 255 0 0
Color[6] 0 0 255
Color[7] 255 255 0
Color[8] 128 128 128
Color[9] 192 192 192
Color[10] 128 0 0
Color[11] 0 128 0
Color[12] 128 128 0
Color[13] 0 0 128
Color[14] 128 0 128
18
Color[15] 255 255 255
ID CAMG Size 4
View Modes 0x800
ID BODY Size 2448
Number of bytes 2448
Successfully parsed = true
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF
63 An ANTLR Grammar for the RIFF Waveform PCM File Format
Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR
The output of ANTLR is a lexer and parser for WAVEFOR PCM files
grammar waveform_pcm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text contained in lexerheader is placed at the top of the lexer
file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Default ByteOrder to Little Endian
String byteOrder = II
Rules used to create the Parser
start symbol waveform_pcm
waveform_pcm returns [boolean result]
19
ckId = fourCC $ckIdvalueequals(RIFF)
ckSize = dWordValue
formType = fourCC $formTypevalueequals(WAVE)
formatSubChunk
dataSubChunk
$result = true
fourCC returns [String value]
BYTE BYTE BYTE BYTE $value = $text
formatSubChunk
subChunkId = fourCC $subChunkIdvalueequals(fmt )
ckSize = dWordValue
commonFields
pcmSpecificField
commonFields
formatTag = wordValue Format category
channels = wordValue Number of channels
samplesPerSec = dWordValue Sampling rate
avgBytesPerSec = dWordValue For buffer estimation
blockAlign = wordValue Data block size
pcmSpecificField
bitsPerSample = wordValue Sample size
dataSubChunk
subChunkId = fourCC $subChunkIdvalueequals(data)
ckSize = dWordValue
waveData[$ckSizevalue]
waveData [long size]
(size gt= 0=gt BYTE size--)+
defines the long data type in terms of BYTE terminals
dWordValue returns [long value] throws UnsupportedEncodingException
b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE
if (byteOrderequals(II))
$value = (long)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 |
getUByteValue($b3) ltlt 16 |
getUByteValue($b4) ltlt 24 )
else
$value = (long)
( getUByteValue($b1) ltlt 24 |
getUByteValue($b2) ltlt 16 |
20
getUByteValue($b3) ltlt 8 |
getUByteValue($b4) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
if (byteOrderequals(II))
$value = (int)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 )
else
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rule used to create the Lexer
BYTE
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format
7 Related Research
Researchers have developed a number of data description languages to support validation of file
formats EAST (Enhanced ADA Subset) is a data description language developed by the
Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary
Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic
information The EAST description is used to interpret and provide access to information in
binary and text files
DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to
manipulate Java jar files and ELF object files The PADSC data description language can be
used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC
compiler compiles the description into tools that can be used to recognize manipulate and
transform the data into other formats
21
The Data Format Description Language (DFDL) is being developed by a Working Group of the
Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML
Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-
XML physical representations Version 1 of the language specification was published in 2011
and a parser is being implemented
Each of the data description languages described above can be used to define data types and file
structures of binary files However they are based on file structure declarations such as those of
C or Java The binary file format grammar described in this paper most closely resembles DFDL
However the binary file grammar described in this paper is the only data description language
based on formal grammars that is used for creating recognizers for file formats
JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to
provide automated and efficient identification and validation of the formats of digital files
JHOVE is a format-specific digital object validation API written in Java JHOVE supports
validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF
UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California
digital Library 2011]
The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation
of binary file formats As a matter of fact binary file grammars and parsers have been
constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-
based format TIFF However the research reported herein differs from that of the JHOVE
project in that what is sought is a technology for generating validators for binary file formats
from grammars specifying the binary file format
8 Conclusion
The research question addressed by his research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files Amajor family of binary file formats chunk-based was described Then extensions
to the concepts of context-free grammars and attribute grammars were described that enable the
specification of binary file formats Examples of attribute array grammars for two chunk-based
binary file format were then presented Recursive descent parsers for this classes of grammars
was then described Experience in using ANTLR a parser generator for LL() string grammars
in generating parsers for recognizing the formats of binary files was discussed Finally related
research in data description languages for binary file formats was described and the JHOVE
project in creating Java-based tools for validating file formats was described
22
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats Furthermore these grammars can be used with recursive descent
parsers for validating the file formats of chunk-based binary files It remains to be determined
whether these attribute grammars based on binary file (array) grammars are adequate to define
other families of binary file formats
To be a practical technology for generating recognizers (validators) for binary file formats a
parser generator is needed that creates from a binary file grammar a recursive descent parser
(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and
generates a lexer as well as a parser is too heavyweight a parser generator for this task and does
not have the requisite data types built in
A capability to validate binary file formats assumes that the file format of a file is already
known In this research we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009] We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendix a We also have file
signature tests (magic tests) for all these file types
We would like to have a parser generator (also called a compilerndashcompiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents The
closest parser generator that we have found to having these capabilities is ANTLR
(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar However it works with strings and generates a separate scanner
Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based
file formats We have specifications for each of file formats and samples of files with these
formats
These file formats have a similar structure We have developed binary file grammars for two of
these binary file formats We should be able to do so for all of them We used these grammars
with the extensions to ANTLR to generate top-down recursive descent parsers for the file
formats defined with these grammars
The significance of success in this research task is that if binary file formats can be specified
with binary file grammars then only one parsing generator is needed to generate the parsers for
verifying that a file conforms to a particular format Similarly the same parsing generator can be
used for conversion of legacy file formats to current or standard formats Finally the same parser
generator can be used for generating viewerplayers for most file formats This would increase
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
vi
List of Figures
Figure 1 File format (Layout) of a RIFF container for a Webp Image 2
Figure 2 ILBM File hampic Displayed 7
Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format 7
Figure 4 A Parse Tree for the ILBM File hampic 8
Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format 10
Figure 6 An ANTLR Grammar for the ILBM File Format 16
Figure 7 A Java Application for Parsing an ILBM File 17
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF 18
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format 20
1
1 Introduction
11 Background
Automated tools are required for identifying and validating the formats of the huge number of
files ingested into digital data and record archives Automated tools are also needed for
viewingplaying text and binary file formats and for converting legacy and obsolete file formats
to standard or current formats Technologies such as data description languages have emerged to
address these preservation challenges [Dunckley et al 2007]
Context-free grammars have been used to specify the syntax of programming languages
Attribute grammars provide a framework for formally specifying the semantics of a language
based on its context-free grammar and for addressing the mildly context-sensitive features of
programming languages such as agreement of the data types of variables in expressions with data
type declarations Such grammars have parsingtranslation algorithms that can be used for
syntax-checking and interpretation or translation of the languages the grammars define
12 Purpose
The research question addressed by this research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files
13 Scope
The next section discusses the traditional approach to specifying file formats and two of the
major families of binary file formats In section 3 extensions to the concepts of context-free
grammars and attribute grammars are described that enable the specification of binary file
formats In section 4 examples of attribute array grammars for a chunk-based binary file formats
are presented In section 5 recursive descent parsers for these classes of grammars are described
In section 6 experience in using ANTLR a parser generator for LL() string grammars in
generating parsers for recognizing the formats of binary files is discussed In section 7 related
research is described Section 8 summarizes the results
2 Binary File Formats
Traditionally a file (or record) layout was a form showing how fields were positioned in a file
(or record) The fields are named and have as attributes a data type length and sometimes a
constant value Fields are offset at addresses relative to the beginning of a file File formats are
still specified in this manner For instance Figure 1 shows the format layout of a RIFF container
for the VP8 codec of a Webp image [Google 2010]
2
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk starting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
Figure 1 File format (Layout) of a RIFF container for a Webp Image
Many binary file formats are specified using pseudo-regular expressions or pseudo-EBNF
notation EBNF (Extended Backus-Naur Form) is a notation for expressing context-free
grammars in a compact human readable way Pseudo-EBNF is similar in concept to pseudocode
which is a high-level description of a computer algorithm that is intended for human
understanding but that omits details that would be necessary for computer execution The
pseudo-EBNF (or regular expression) specification of a file format is augmented with a natural
language description of the details that are not actually expressible in the EBNF (or regular
expression) notation These details are often the context-sensitive relationships of the size of an
array to its actual length or the relationship of an address pointer to the actual location of the
data pointed to in the file These context-sensitive relationships are not expressible in a context-
free grammar (or EBNF notation) It is a goal of this research to extend the EBNF notation and
concept of a context-free grammar to include the details that are necessary to precisely specify
the file formats of binary file formats that include these context-sensitive features
Binary file formats with a similar file structure are referred to as a family of file formats There
are two families of binary file formats that are readily distinguished the chunk-based and the
directory-based file formats
Chunk-based file formats were created by Electronic Arts and Commodore-Amiga as the
Interchange File Format (IFF) [Morrison 1985] An IFF file itself is one entire IFF chunk A
chunk consists of an ID tag a size and size bytes of data The data may include chunks called
subchunks Subchunks have the same structure as chunks Subchunks can have subchunks
Chunk-based file formats are primarily used as containers for multimedia but have also been
used for word processing files including pictures and other figures File format specifications for
chunk-based formats may use other terms than chunks in describing the format for instance
atoms in QuickTimeMP4 segments in JPEG and ―tagged data representations as in
AutoCAD DXF
3
The Audio Interchange File Format (AIFF) developed by Apple computer in 1988 was based on
Electronic Arts Interchange File Format Apple Core Audio Format (CAF) Applelsquos replacement
format for AIFF remains a chunk-based format [Apple Computer 2005] The Resource
Interchange File Format (RIFF) introduced by IBM and Microsoft [1991] is also based on
Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-Video
Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis WebP a picture format
recently introduced by Google[2010] also uses RIFF as a container
Microsoftlsquos Advanced Systems Format (ASF) [2004] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV) Microsoftlsquos Binary Interchange File Format [Rentz 2008]
(Microsoft Excellsquos File Format) is chunk-based
Another family of binary file formats is directory-based A directory-based file format consists of
one or more directory tables which contain one or more directory entries A directory entry
specifies where the actual data for a type of information is located This is the scheme used in
TIFF files OLE (Microsoft Object Linking and Embedding) files OASIS OpenDocument and
Microsoft Open Office files
3 Context-Free Grammars and Attribute Grammars
Context-free grammars have been widely used to define programming languages A context-free
grammar G is a quadruple ltN Σ S Pgt where
N is a finite set of non-terminal symbols
Σ is a finite set of terminal symbols
S ε N is the start symbol
P is a set of production rules of the form N rarr N Σ
The set of all strings over a vocabulary Σ is denoted by Σ The language generated by a context-
free grammar G is the set L(G) = w w Σ and S w
31 Context-Free Grammars for Arrays of Data Types
An array is a data structure consisting of a collection of elements (values or variables) each
identified by an index An array is stored so that the position of each element can be computed
from its index For example an array of 10 integer variables with indices 0 through 9 may be
4
stored as 10 (4 byte) words at file addresses 0 4 8 hellip36 so that the element with index i has the
address 4 times i
Arrays are used to implement many other data structures such as lists and strings In most
modern computers the internal memory and external memory of storage devices (and files on
those devices) is a one-dimensional array of data types whose indices are their addresses
A data type is a classification of one of various types of data such as floating-point integer
character pointer or Boolean that determines the possible values for that type the operations
that can be performed on values of that type and the way values of that type can be stored Data
types have names such as int16 int32 float char bool ptr We define a binary file to be an array
of values of data of various types
We define a context-free binary file grammar BG as a quintuple ltN D Σ S Pgt where
N is a finite set of non-terminal symbols
D is a set of data types
Σ is a finite set of binary values of data types D called terminals
S ε N is the start symbol
P is a set of production rules of the form N rarr N Σ
Let DataTypes indicate the union of all the values of all datatypes D The set of all binary files is
BinaryFiles = [IndicesrarrDataTypes] where Indices = 1 The language generated by a
context-free array (binary file) grammar AG is the set L(AG) = w w BinaryFiles and S w
By primitive nonterminal is meant a symbol which would be replaced by one or more terminal
symbols were an additional production rule applied In other words a primitive nonterminal is a
nonterminal that appears on the left-hand side of productions which only have one or more
terminal symbols on the right We adopt the notation shown below for primitive nonterminals
ltprimitive_nonterminal_symbol type=datatype_name constant=valuegt
32 Attribute Grammars
Context-free grammars cannot represent context-sensitive aspects of programming languages
such as (1) enforcing the constraint that all variables are declared before they are used or (2)
checking the number of parameters in a function call against the number in the functionlsquos
declaration Context-free grammars have also been used to define the syntax of English and other
natural languages but they cannot define such context-sensitivity as subject verb agreement in
English sentences A number of extensions of context-free grammars have been proposed to
5
address the context-sensitivity of programming languages and natural languages These include
indexed [Aho 1968] recording [Barth 1979] affix [Koster 1971] VW [van Wijngaarden 1974]
and attribute grammars [Knuth 1969 1971]
Context-free grammars also cannot represent the semantics of programming languages Knuth
[1968 1971] proposed an extension of context-free grammars termed attribute grammars that
addresses the semantics as well as the context-sensitivity of programming languages
An attribute grammar GA is a triple ltG A ARgt where G is a context-free grammar for the
language A associates each grammar symbol X isin (N Σ) with a set of attributes and AR
associates each production R isin P with a set of attribute computation rules A(X) where X isin (N
Σ) can be further partitioned into two sets synthesized attributes S(X) and inherited attributes
I(X) AR(R) where R isin P contains rules for computing inherited and synthesized attributes
associated with the symbols in the production R Synthesized attributes are evaluated and passed
up from deeper terminals and non-terminals while inherited attributes are passed down from
upper non-terminals
An L-attributed grammar is an attribute grammar that uses
i) synthesized attributes and
ii) inherited attributes where the inherited attribute depends only on
a) attributes of the other child nodes to its left
b) inherited attributes of P itself
L-attributed grammars allow the attributes to be evaluated in one left-to-right traversal of the
abstract syntax tree As a result attribute evaluation in L-attributed grammars can be
incorporated conveniently in top-down parsing
33 Semantic Grammar
An interpreter for a binary file format that is using a grammar to display or play the contents of a
file or to convert it to another file format needs to produce a semantic representation that is
appropriate to the task Natural language understanding components of dialog systems face a
similar need to produce a semantic representation that is appropriate to the dialogue task To
address this need some developers of dialogue systems make use of an approach called semantic
grammar [Issar and Ward 1993 Ward and Issar 1994] A semantic grammar is a CFG in which
the non-terminals on the left-hand side of rules correspond to semantic entities A parser for a
semantic grammar produces a labeling of the input string with semantic node labels for example
SHOW FLIGHTS ORIGIN DESTINATION DEPART_DATE DEPART_TIME Show me flights from Boston to San Francisco on Tuesday morning
6
A semantic grammar is just what we need to file in the values of a programming language data
structure Semantic grammars are not applicable to generic systems but are limited to specific
domains This is not a problem in specifying or parsing a binary file format since a grammar for
a binary file format is not meant to apply to all binary file formats but just to a single format or
to a family of file formats
4 Grammars for Chunk-based Binary File Formats
The attribute grammar that we present will be a mixture of synthesized semantic rules (attribute
values of the left side of a production are calculated from attribute values of symbols on the right
side) and inherited rules (attribute values of symbols on the right side of a production are
calculated from attribute values of the symbol on the left side)
The field names of the file format spec are represented as nonterminals in the grammar Constant
values in fields are terminals that appear on the right-hand side of productions Nonterminal
fieldnames have data types associated with them
41 InterLeaved BitMap (ILBM)
The author of the Interleaved Bitmap (ILBM) file format recognized that the structure of the file
format could be specified using pseudo EBNF rules and C data structure declarations [Morrison
1986] In this section the ―grammar of the ILBM file format specification is converted to a
context-free binary grammar with synthesized and inherited attributes
Figure 2 shows a display of an ILBM file hampic Figure 3 shows an attribute grammar for its
chunk-based binary file format The attribute grammar is a mixture of synthesized semantic rules
(attribute values of the left side of a production are calculated from attribute values of symbols
on the right side) and inherited rules (attribute values of symbols on the right side of a
production are calculated from attribute values of the symbol on the left side) The attribute
grammar is also an L-attributed grammar Figure 4 shows a parse tree for the file hampic
7
Figure 2 ILBM File hampic Displayed
ltILBMgt rarr ―FORM ltcksize TYPE=UINT32gt ―ILBM ltBMHDgt ltCMAPgt ltCAMGgt ltBODYgt
ILBMcksize = cksizevalue
ltBMHDgt rarr ―BMHD ltcksize TYPE=uint32gt ltBitmapHeadergt
ltBitMapheadergt rarr ltwidth type=UINT32gt
lt height type=UINT16gt
lt xposition type=INT16gt
ltyposition type=INT16gt
ltnplanes type=BYTEgt
ltmasking type=BYTEgt
ltcompression type=BYTEgt
ltreserved type=BYTEgt
lttransparentcolor type=UINT16gt
ltxaspect type=BYTEgt
ltyaspect type=BYTEgt
ltpagewidth type=INT16gt
ltpageheight type=INT16gt
ltCMAPgt -rarr―CMAP ltcksize type=UINT32gt ltcolorgt[n] n=cksizeval3
ltCAMGgt rarr ―CAMG ltcksizetype=UINT32gt ltviewmode type=INT32gt
ltBODYgt rarr ―BODY ltcksize type=UINT32gt ltdata type=BYTEgt[cksize]
ltcolorgt rarr ltred type=BYTEgt ltgreen type= BYTEgt ltblue type=BYTEgt
Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format
8
Figure 4 A Parse Tree for the ILBM File hampic
9
42 RIFF Waveform PCM Audio File Format (WAVE)
The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp
Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF
rules with the cksize inserted To convert this grammar into a binary file format grammar data
types are added for primitive non-terminals
ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE
ltfmt-ckgt
[ltfact-ckgt]
[ltcue-ckgt]
[ltplaylist-ckgt]
[ltassoc-data-listgt]
ltwave-datagt
ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt
ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt
ltwChannelsWORDgt
ltdwSamplesPerSecDWORDgt
ltdwAvgBytesPerSecDWORDgt
ltwBlockAlignWORDgt
ltwave-datagt -gt ltdata-ckgt | ltwave-listgt
ltdata-ckgt -gt data cksize ltwave-datagt
ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt
ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt
ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt
ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+
ltcue-pointgt -gt ltdwNameDWORDgt
ltdwPositionDWORDgt
ltfccChunkFOURCCgt
ltdwChunkStartDWORDgt
ltdwBlockStartDWORDgt
ltdwSampleOffsetDWORDgt
ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+
ltplay-segmentgt -gt ltdwNameDWORDgt
ltdwLengthDWORDgt
ltdwLoopsDWORD
ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt
ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt
ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt
10
ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt
ltdwSampleLengthDWORDgt
ltdwPurposeDWORDgt
ltwCountryWORDgt
ltwLanguageWORDgt
ltwDialectWORDgt
ltwCodePageWORDgt
ltdataBYTEgt+
ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt
ltfileDataBYTEgt+
Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format
5 Recursive Descent Parsers for Binary File Grammars
A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)
procedures where each procedure implements one of the production rules of the grammar A
predictive parser is a recursive descent parser that does not require backtracking Predictive
parsing is possible only for the class of LL(k) grammars which are the context-free grammars
for which there exists some positive integer k that allows a recursive descent parser to decide
which production to use by examining only the next k tokens of input A recursive descent parser
for the binary file format grammars defined in this paper does not require a separate lexer to
identify the data of a file The data types are predicted by the grammar rules
The semantic rules of an L-attributed grammar can be included in the recursive descent parser to
check for context-sensitive aspects of the grammar such as chunk size or image dimensions and
use these values in parsing the chunk data or images The pointers to file addresses that occur in
the file formats are handled by pushing the current address in the file onto a pushdown stack
When the parser exhausts the procedures implementing production rules corresponding to the
pointed to location the topmost address on the pushdown stack is popped and the procedure
corresponding to the nonterminal at that position is executed
Recursive descent parsers with a pushdown stack can be used with binary file format grammars
to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads
a file and generates an error if the file does not conform to the syntax specified by the grammar
11
6 A Parser Generator for Recognizing Binary File Formats
What we want is a parser generator whose input is a binary file attribute grammar for a particular
binary file format and whose generated output is the Java source code of a recursive descent
parser (with a pushdown stack if needed) for the class of binary file formats specified by the
grammar Such a parser generator does not yet exist but there is a widely used parser generator
for attribute grammars for string-based languages
61 ANTLR
ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing
[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a
language and generates as output source code for a recognizer for that language ANTLR
supports generating code in a number of the programming languages including C Java
JavaScript and Python
ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that
converts a sequence of characters into tokens A lexer is not needed to parse binary file formats
because the data types predicted by the parser are the tokens of the grammar Furthermore the
capability to recognize LL() grammars is not needed because the binary file format grammars
specified so far do not require lookahead of k symbols
62 Using ANTLR to Generate a Parser for a Binary File Format
It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-
based and directory-based binary file formats This is accomplished by writing a lexer that treats
each input byte in a file as a character token Then functions are created for each data type in the
binary file grammar that convert the appropriate number of character tokens to binary data types
eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our
attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a
parser generator for binary file grammars
621 An ANTLR Grammar for an ILBM Chunk-based File Format
Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of
ANTLR is a lexer and parser for ILBM files
12
grammar iblm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in lexerheader is placed at the top of the lexer file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Grammar Rules used to create the Parser
Start symbol ilbm
iblm returns [boolean result]
FORM ckSize ILBM
Print out ILBM chunksize
Systemoutprintln(ID ILBM Size + $ckSizevalue)
Semantic predicate that ensures that at least on BMHD is found
propertyChunks $propertyChunksfoundBMHD == true
dataChunks
body
$result = true
The bmhd cmap and camg property chunks can occur in any order
At least one BMHD data chunk must be present
propertyChunks returns [boolean foundBMHD]
(bmhd $foundBMHD = true
| cmap
| camg)+
bmhd
BMHD ckSize
Systemoutprintln(ID BMHD Size + $ckSizevalue)
13
bitMapHeader
bitMapHeader
width = uWordValue
height = uWordValue
xPosition = wordValue
yPosition = wordValue
nPanes = uByteValue
masking = uByteValue
compression = uByteValue
reserved1 = uByteValue
transparentColor = uWordValue
xAspect = uByteValue
yAspect = uByteValue
pageWidth = wordValue
pageHeight = wordValue
Print out values in the bitmap structure
Systemoutprintln( Width + $widthvalue)
Systemoutprintln( Height + $heightvalue)
Systemoutprintln( X Position + $xPositionvalue)
Systemoutprintln( Y Position + $yPositionvalue)
Systemoutprintln( Number of Panes + $nPanesvalue)
Systemoutprintln( Masking + $maskingvalue)
Systemoutprintln( Compression + $compressionvalue)
Systemoutprintln( Reserved1 + $reserved1value)
Systemoutprintln( TransParent Color +
$transparentColorvalue)
Systemoutprintln( X Aspest + $xAspectvalue)
Systemoutprintln( Y Aspect + $yAspectvalue)
Systemoutprintln( Page Width + $pageWidthvalue)
Systemoutprintln( Page Height + $pageHeightvalue)
cmap returns [long value]
Initialize the index value i in the cmap function
init int i = 0
CMAP ckSize
Print out chunk size
$value = $ckSizevalue
Systemoutprintln(ID CMAP Size + $ckSizevalue)
(colorRegister
Print out color values
Systemoutprintln( Color[ + i++ + ] +
$colorRegisterredValue +
+ $colorRegistergreenValue +
+ $colorRegisterblueValue +
)
)+
colorRegister returns [int redValue int greenValue int blueValue]
14
red = uByteValue green = uByteValue blue = uByteValue
$redValue = $redvalue
$greenValue = $greenvalue
$blueValue = $bluevalue
camg
CAMG ckSize longValue
Systemoutprintln(ID CAMG Size + $ckSizevalue)
Systemoutprintln( View Modes 0x +
LongtoHexString($longValuevalue))
The crng and ccrt data chunks
dataChunks
crng
| ccrt
crng
CRNG ckSize
Systemoutprintln(ID CRNG Size + $ckSizevalue)
cRange
cRange
pad1 = wordValue $pad1value == 0
rate = wordValue
active = wordValue
low = uByteValue
high = uByteValue
ccrt
CCRT ckSize
Systemoutprintln(ID CCRT Size + $ckSizevalue)
cycleInfo
cycleInfo
direction = wordValue
start = uByteValue
end = uByteValue
seconds = longValue
microseconds = longValue
pad = wordValue $padvalue == 0
body
BODY ckSize
Systemoutprintln(ID BODY Size + $ckSizevalue)
$ckSizevalue gt 0 data
Systemoutprintln( Number of bytes + $datavalue)
$datavalue == $ckSizevalue
defines the data chunk of the body chunk
data returns [long value]
(b+=BYTE)+ $bsize() gt 0
$value = $bsize()
15
The cksize is the value of a long integer
ckSize returns [long value]
longValue
$value = $longValuevalue
defines the longValue data typein terms of BYTE terminals
longValue returns [long value] throws UnsupportedEncodingException
h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE
$value = (long)
( getUByteValue($h1) ltlt 24 |
getUByteValue($h2) ltlt 16 |
getUByteValue($l1) ltlt 8 |
getUByteValue($l2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uWordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (short)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uByteValue returns [int value]
BYTE
calls the getUByteValue from a BYTE token
16
$value = (int) getUByteValue($BYTE)
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rules used to create the Lexer
FORM FORM
ILBM ILBM
BMHD BMHD
CMAP CMAP
CAMG CAMG
CRNG CRNG
CCRT CCRT
BODY BODY
BYTE
Figure 6 An ANTLR Grammar for the ILBM File Format
622 The Lexer and Parser
When provided the file described in the previous section ANTLR generates the ilbmlexer and
ilbmparser Java programs The Java application shown in Figure 6 uses these programs to
validate the ILBM file BLKIFF It produces the output shown in Figure 7
package orggtrigrammarschunk
import javaioIOException
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRStringStream
import organtlrruntimeCharStream
import organtlrruntimeCommonTokenStream
import organtlrruntimeRecognitionException
17
import organtlrruntimeTokenStream
import gtriorggrammarschunkiblmLexer
import gtriorggrammarschunkiblmParser
public class Test
static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF
public static void main(String[] args) throws RecognitionException IOException
ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)
iblmLexer lexer = new iblmLexer(stream)
TokenStream tokenStream = new CommonTokenStream(lexer)
iblmParser parser = new iblmParser(tokenStream)
Systemoutprintln(FILE + filePath)
Systemoutprintln(Successfully parsed = + parseriblm())
Figure 7 A Java Application for Parsing an ILBM File
FILE CUserssl170PicturesIFF-ilbmBLKIFF
ID ILBM Size 2556
ID BMHD Size 20
Width 200
Height 144
X Position 0
Y Position 0
Number of Panes 6
Masking0
Compression1
Reserved10
TransParent Color0
X Aspest3
Y Aspect3
Page Width200
Page Height144
ID CMAP Size 48
Color[0] 0 0 0
Color[1] 0 128 128
Color[2] 0 255 255
Color[3] 255 0 255
Color[4] 0 255 0
Color[5] 255 0 0
Color[6] 0 0 255
Color[7] 255 255 0
Color[8] 128 128 128
Color[9] 192 192 192
Color[10] 128 0 0
Color[11] 0 128 0
Color[12] 128 128 0
Color[13] 0 0 128
Color[14] 128 0 128
18
Color[15] 255 255 255
ID CAMG Size 4
View Modes 0x800
ID BODY Size 2448
Number of bytes 2448
Successfully parsed = true
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF
63 An ANTLR Grammar for the RIFF Waveform PCM File Format
Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR
The output of ANTLR is a lexer and parser for WAVEFOR PCM files
grammar waveform_pcm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text contained in lexerheader is placed at the top of the lexer
file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Default ByteOrder to Little Endian
String byteOrder = II
Rules used to create the Parser
start symbol waveform_pcm
waveform_pcm returns [boolean result]
19
ckId = fourCC $ckIdvalueequals(RIFF)
ckSize = dWordValue
formType = fourCC $formTypevalueequals(WAVE)
formatSubChunk
dataSubChunk
$result = true
fourCC returns [String value]
BYTE BYTE BYTE BYTE $value = $text
formatSubChunk
subChunkId = fourCC $subChunkIdvalueequals(fmt )
ckSize = dWordValue
commonFields
pcmSpecificField
commonFields
formatTag = wordValue Format category
channels = wordValue Number of channels
samplesPerSec = dWordValue Sampling rate
avgBytesPerSec = dWordValue For buffer estimation
blockAlign = wordValue Data block size
pcmSpecificField
bitsPerSample = wordValue Sample size
dataSubChunk
subChunkId = fourCC $subChunkIdvalueequals(data)
ckSize = dWordValue
waveData[$ckSizevalue]
waveData [long size]
(size gt= 0=gt BYTE size--)+
defines the long data type in terms of BYTE terminals
dWordValue returns [long value] throws UnsupportedEncodingException
b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE
if (byteOrderequals(II))
$value = (long)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 |
getUByteValue($b3) ltlt 16 |
getUByteValue($b4) ltlt 24 )
else
$value = (long)
( getUByteValue($b1) ltlt 24 |
getUByteValue($b2) ltlt 16 |
20
getUByteValue($b3) ltlt 8 |
getUByteValue($b4) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
if (byteOrderequals(II))
$value = (int)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 )
else
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rule used to create the Lexer
BYTE
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format
7 Related Research
Researchers have developed a number of data description languages to support validation of file
formats EAST (Enhanced ADA Subset) is a data description language developed by the
Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary
Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic
information The EAST description is used to interpret and provide access to information in
binary and text files
DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to
manipulate Java jar files and ELF object files The PADSC data description language can be
used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC
compiler compiles the description into tools that can be used to recognize manipulate and
transform the data into other formats
21
The Data Format Description Language (DFDL) is being developed by a Working Group of the
Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML
Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-
XML physical representations Version 1 of the language specification was published in 2011
and a parser is being implemented
Each of the data description languages described above can be used to define data types and file
structures of binary files However they are based on file structure declarations such as those of
C or Java The binary file format grammar described in this paper most closely resembles DFDL
However the binary file grammar described in this paper is the only data description language
based on formal grammars that is used for creating recognizers for file formats
JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to
provide automated and efficient identification and validation of the formats of digital files
JHOVE is a format-specific digital object validation API written in Java JHOVE supports
validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF
UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California
digital Library 2011]
The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation
of binary file formats As a matter of fact binary file grammars and parsers have been
constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-
based format TIFF However the research reported herein differs from that of the JHOVE
project in that what is sought is a technology for generating validators for binary file formats
from grammars specifying the binary file format
8 Conclusion
The research question addressed by his research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files Amajor family of binary file formats chunk-based was described Then extensions
to the concepts of context-free grammars and attribute grammars were described that enable the
specification of binary file formats Examples of attribute array grammars for two chunk-based
binary file format were then presented Recursive descent parsers for this classes of grammars
was then described Experience in using ANTLR a parser generator for LL() string grammars
in generating parsers for recognizing the formats of binary files was discussed Finally related
research in data description languages for binary file formats was described and the JHOVE
project in creating Java-based tools for validating file formats was described
22
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats Furthermore these grammars can be used with recursive descent
parsers for validating the file formats of chunk-based binary files It remains to be determined
whether these attribute grammars based on binary file (array) grammars are adequate to define
other families of binary file formats
To be a practical technology for generating recognizers (validators) for binary file formats a
parser generator is needed that creates from a binary file grammar a recursive descent parser
(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and
generates a lexer as well as a parser is too heavyweight a parser generator for this task and does
not have the requisite data types built in
A capability to validate binary file formats assumes that the file format of a file is already
known In this research we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009] We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendix a We also have file
signature tests (magic tests) for all these file types
We would like to have a parser generator (also called a compilerndashcompiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents The
closest parser generator that we have found to having these capabilities is ANTLR
(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar However it works with strings and generates a separate scanner
Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based
file formats We have specifications for each of file formats and samples of files with these
formats
These file formats have a similar structure We have developed binary file grammars for two of
these binary file formats We should be able to do so for all of them We used these grammars
with the extensions to ANTLR to generate top-down recursive descent parsers for the file
formats defined with these grammars
The significance of success in this research task is that if binary file formats can be specified
with binary file grammars then only one parsing generator is needed to generate the parsers for
verifying that a file conforms to a particular format Similarly the same parsing generator can be
used for conversion of legacy file formats to current or standard formats Finally the same parser
generator can be used for generating viewerplayers for most file formats This would increase
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
1
1 Introduction
11 Background
Automated tools are required for identifying and validating the formats of the huge number of
files ingested into digital data and record archives Automated tools are also needed for
viewingplaying text and binary file formats and for converting legacy and obsolete file formats
to standard or current formats Technologies such as data description languages have emerged to
address these preservation challenges [Dunckley et al 2007]
Context-free grammars have been used to specify the syntax of programming languages
Attribute grammars provide a framework for formally specifying the semantics of a language
based on its context-free grammar and for addressing the mildly context-sensitive features of
programming languages such as agreement of the data types of variables in expressions with data
type declarations Such grammars have parsingtranslation algorithms that can be used for
syntax-checking and interpretation or translation of the languages the grammars define
12 Purpose
The research question addressed by this research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files
13 Scope
The next section discusses the traditional approach to specifying file formats and two of the
major families of binary file formats In section 3 extensions to the concepts of context-free
grammars and attribute grammars are described that enable the specification of binary file
formats In section 4 examples of attribute array grammars for a chunk-based binary file formats
are presented In section 5 recursive descent parsers for these classes of grammars are described
In section 6 experience in using ANTLR a parser generator for LL() string grammars in
generating parsers for recognizing the formats of binary files is discussed In section 7 related
research is described Section 8 summarizes the results
2 Binary File Formats
Traditionally a file (or record) layout was a form showing how fields were positioned in a file
(or record) The fields are named and have as attributes a data type length and sometimes a
constant value Fields are offset at addresses relative to the beginning of a file File formats are
still specified in this manner For instance Figure 1 shows the format layout of a RIFF container
for the VP8 codec of a Webp image [Google 2010]
2
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk starting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
Figure 1 File format (Layout) of a RIFF container for a Webp Image
Many binary file formats are specified using pseudo-regular expressions or pseudo-EBNF
notation EBNF (Extended Backus-Naur Form) is a notation for expressing context-free
grammars in a compact human readable way Pseudo-EBNF is similar in concept to pseudocode
which is a high-level description of a computer algorithm that is intended for human
understanding but that omits details that would be necessary for computer execution The
pseudo-EBNF (or regular expression) specification of a file format is augmented with a natural
language description of the details that are not actually expressible in the EBNF (or regular
expression) notation These details are often the context-sensitive relationships of the size of an
array to its actual length or the relationship of an address pointer to the actual location of the
data pointed to in the file These context-sensitive relationships are not expressible in a context-
free grammar (or EBNF notation) It is a goal of this research to extend the EBNF notation and
concept of a context-free grammar to include the details that are necessary to precisely specify
the file formats of binary file formats that include these context-sensitive features
Binary file formats with a similar file structure are referred to as a family of file formats There
are two families of binary file formats that are readily distinguished the chunk-based and the
directory-based file formats
Chunk-based file formats were created by Electronic Arts and Commodore-Amiga as the
Interchange File Format (IFF) [Morrison 1985] An IFF file itself is one entire IFF chunk A
chunk consists of an ID tag a size and size bytes of data The data may include chunks called
subchunks Subchunks have the same structure as chunks Subchunks can have subchunks
Chunk-based file formats are primarily used as containers for multimedia but have also been
used for word processing files including pictures and other figures File format specifications for
chunk-based formats may use other terms than chunks in describing the format for instance
atoms in QuickTimeMP4 segments in JPEG and ―tagged data representations as in
AutoCAD DXF
3
The Audio Interchange File Format (AIFF) developed by Apple computer in 1988 was based on
Electronic Arts Interchange File Format Apple Core Audio Format (CAF) Applelsquos replacement
format for AIFF remains a chunk-based format [Apple Computer 2005] The Resource
Interchange File Format (RIFF) introduced by IBM and Microsoft [1991] is also based on
Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-Video
Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis WebP a picture format
recently introduced by Google[2010] also uses RIFF as a container
Microsoftlsquos Advanced Systems Format (ASF) [2004] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV) Microsoftlsquos Binary Interchange File Format [Rentz 2008]
(Microsoft Excellsquos File Format) is chunk-based
Another family of binary file formats is directory-based A directory-based file format consists of
one or more directory tables which contain one or more directory entries A directory entry
specifies where the actual data for a type of information is located This is the scheme used in
TIFF files OLE (Microsoft Object Linking and Embedding) files OASIS OpenDocument and
Microsoft Open Office files
3 Context-Free Grammars and Attribute Grammars
Context-free grammars have been widely used to define programming languages A context-free
grammar G is a quadruple ltN Σ S Pgt where
N is a finite set of non-terminal symbols
Σ is a finite set of terminal symbols
S ε N is the start symbol
P is a set of production rules of the form N rarr N Σ
The set of all strings over a vocabulary Σ is denoted by Σ The language generated by a context-
free grammar G is the set L(G) = w w Σ and S w
31 Context-Free Grammars for Arrays of Data Types
An array is a data structure consisting of a collection of elements (values or variables) each
identified by an index An array is stored so that the position of each element can be computed
from its index For example an array of 10 integer variables with indices 0 through 9 may be
4
stored as 10 (4 byte) words at file addresses 0 4 8 hellip36 so that the element with index i has the
address 4 times i
Arrays are used to implement many other data structures such as lists and strings In most
modern computers the internal memory and external memory of storage devices (and files on
those devices) is a one-dimensional array of data types whose indices are their addresses
A data type is a classification of one of various types of data such as floating-point integer
character pointer or Boolean that determines the possible values for that type the operations
that can be performed on values of that type and the way values of that type can be stored Data
types have names such as int16 int32 float char bool ptr We define a binary file to be an array
of values of data of various types
We define a context-free binary file grammar BG as a quintuple ltN D Σ S Pgt where
N is a finite set of non-terminal symbols
D is a set of data types
Σ is a finite set of binary values of data types D called terminals
S ε N is the start symbol
P is a set of production rules of the form N rarr N Σ
Let DataTypes indicate the union of all the values of all datatypes D The set of all binary files is
BinaryFiles = [IndicesrarrDataTypes] where Indices = 1 The language generated by a
context-free array (binary file) grammar AG is the set L(AG) = w w BinaryFiles and S w
By primitive nonterminal is meant a symbol which would be replaced by one or more terminal
symbols were an additional production rule applied In other words a primitive nonterminal is a
nonterminal that appears on the left-hand side of productions which only have one or more
terminal symbols on the right We adopt the notation shown below for primitive nonterminals
ltprimitive_nonterminal_symbol type=datatype_name constant=valuegt
32 Attribute Grammars
Context-free grammars cannot represent context-sensitive aspects of programming languages
such as (1) enforcing the constraint that all variables are declared before they are used or (2)
checking the number of parameters in a function call against the number in the functionlsquos
declaration Context-free grammars have also been used to define the syntax of English and other
natural languages but they cannot define such context-sensitivity as subject verb agreement in
English sentences A number of extensions of context-free grammars have been proposed to
5
address the context-sensitivity of programming languages and natural languages These include
indexed [Aho 1968] recording [Barth 1979] affix [Koster 1971] VW [van Wijngaarden 1974]
and attribute grammars [Knuth 1969 1971]
Context-free grammars also cannot represent the semantics of programming languages Knuth
[1968 1971] proposed an extension of context-free grammars termed attribute grammars that
addresses the semantics as well as the context-sensitivity of programming languages
An attribute grammar GA is a triple ltG A ARgt where G is a context-free grammar for the
language A associates each grammar symbol X isin (N Σ) with a set of attributes and AR
associates each production R isin P with a set of attribute computation rules A(X) where X isin (N
Σ) can be further partitioned into two sets synthesized attributes S(X) and inherited attributes
I(X) AR(R) where R isin P contains rules for computing inherited and synthesized attributes
associated with the symbols in the production R Synthesized attributes are evaluated and passed
up from deeper terminals and non-terminals while inherited attributes are passed down from
upper non-terminals
An L-attributed grammar is an attribute grammar that uses
i) synthesized attributes and
ii) inherited attributes where the inherited attribute depends only on
a) attributes of the other child nodes to its left
b) inherited attributes of P itself
L-attributed grammars allow the attributes to be evaluated in one left-to-right traversal of the
abstract syntax tree As a result attribute evaluation in L-attributed grammars can be
incorporated conveniently in top-down parsing
33 Semantic Grammar
An interpreter for a binary file format that is using a grammar to display or play the contents of a
file or to convert it to another file format needs to produce a semantic representation that is
appropriate to the task Natural language understanding components of dialog systems face a
similar need to produce a semantic representation that is appropriate to the dialogue task To
address this need some developers of dialogue systems make use of an approach called semantic
grammar [Issar and Ward 1993 Ward and Issar 1994] A semantic grammar is a CFG in which
the non-terminals on the left-hand side of rules correspond to semantic entities A parser for a
semantic grammar produces a labeling of the input string with semantic node labels for example
SHOW FLIGHTS ORIGIN DESTINATION DEPART_DATE DEPART_TIME Show me flights from Boston to San Francisco on Tuesday morning
6
A semantic grammar is just what we need to file in the values of a programming language data
structure Semantic grammars are not applicable to generic systems but are limited to specific
domains This is not a problem in specifying or parsing a binary file format since a grammar for
a binary file format is not meant to apply to all binary file formats but just to a single format or
to a family of file formats
4 Grammars for Chunk-based Binary File Formats
The attribute grammar that we present will be a mixture of synthesized semantic rules (attribute
values of the left side of a production are calculated from attribute values of symbols on the right
side) and inherited rules (attribute values of symbols on the right side of a production are
calculated from attribute values of the symbol on the left side)
The field names of the file format spec are represented as nonterminals in the grammar Constant
values in fields are terminals that appear on the right-hand side of productions Nonterminal
fieldnames have data types associated with them
41 InterLeaved BitMap (ILBM)
The author of the Interleaved Bitmap (ILBM) file format recognized that the structure of the file
format could be specified using pseudo EBNF rules and C data structure declarations [Morrison
1986] In this section the ―grammar of the ILBM file format specification is converted to a
context-free binary grammar with synthesized and inherited attributes
Figure 2 shows a display of an ILBM file hampic Figure 3 shows an attribute grammar for its
chunk-based binary file format The attribute grammar is a mixture of synthesized semantic rules
(attribute values of the left side of a production are calculated from attribute values of symbols
on the right side) and inherited rules (attribute values of symbols on the right side of a
production are calculated from attribute values of the symbol on the left side) The attribute
grammar is also an L-attributed grammar Figure 4 shows a parse tree for the file hampic
7
Figure 2 ILBM File hampic Displayed
ltILBMgt rarr ―FORM ltcksize TYPE=UINT32gt ―ILBM ltBMHDgt ltCMAPgt ltCAMGgt ltBODYgt
ILBMcksize = cksizevalue
ltBMHDgt rarr ―BMHD ltcksize TYPE=uint32gt ltBitmapHeadergt
ltBitMapheadergt rarr ltwidth type=UINT32gt
lt height type=UINT16gt
lt xposition type=INT16gt
ltyposition type=INT16gt
ltnplanes type=BYTEgt
ltmasking type=BYTEgt
ltcompression type=BYTEgt
ltreserved type=BYTEgt
lttransparentcolor type=UINT16gt
ltxaspect type=BYTEgt
ltyaspect type=BYTEgt
ltpagewidth type=INT16gt
ltpageheight type=INT16gt
ltCMAPgt -rarr―CMAP ltcksize type=UINT32gt ltcolorgt[n] n=cksizeval3
ltCAMGgt rarr ―CAMG ltcksizetype=UINT32gt ltviewmode type=INT32gt
ltBODYgt rarr ―BODY ltcksize type=UINT32gt ltdata type=BYTEgt[cksize]
ltcolorgt rarr ltred type=BYTEgt ltgreen type= BYTEgt ltblue type=BYTEgt
Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format
8
Figure 4 A Parse Tree for the ILBM File hampic
9
42 RIFF Waveform PCM Audio File Format (WAVE)
The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp
Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF
rules with the cksize inserted To convert this grammar into a binary file format grammar data
types are added for primitive non-terminals
ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE
ltfmt-ckgt
[ltfact-ckgt]
[ltcue-ckgt]
[ltplaylist-ckgt]
[ltassoc-data-listgt]
ltwave-datagt
ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt
ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt
ltwChannelsWORDgt
ltdwSamplesPerSecDWORDgt
ltdwAvgBytesPerSecDWORDgt
ltwBlockAlignWORDgt
ltwave-datagt -gt ltdata-ckgt | ltwave-listgt
ltdata-ckgt -gt data cksize ltwave-datagt
ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt
ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt
ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt
ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+
ltcue-pointgt -gt ltdwNameDWORDgt
ltdwPositionDWORDgt
ltfccChunkFOURCCgt
ltdwChunkStartDWORDgt
ltdwBlockStartDWORDgt
ltdwSampleOffsetDWORDgt
ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+
ltplay-segmentgt -gt ltdwNameDWORDgt
ltdwLengthDWORDgt
ltdwLoopsDWORD
ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt
ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt
ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt
10
ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt
ltdwSampleLengthDWORDgt
ltdwPurposeDWORDgt
ltwCountryWORDgt
ltwLanguageWORDgt
ltwDialectWORDgt
ltwCodePageWORDgt
ltdataBYTEgt+
ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt
ltfileDataBYTEgt+
Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format
5 Recursive Descent Parsers for Binary File Grammars
A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)
procedures where each procedure implements one of the production rules of the grammar A
predictive parser is a recursive descent parser that does not require backtracking Predictive
parsing is possible only for the class of LL(k) grammars which are the context-free grammars
for which there exists some positive integer k that allows a recursive descent parser to decide
which production to use by examining only the next k tokens of input A recursive descent parser
for the binary file format grammars defined in this paper does not require a separate lexer to
identify the data of a file The data types are predicted by the grammar rules
The semantic rules of an L-attributed grammar can be included in the recursive descent parser to
check for context-sensitive aspects of the grammar such as chunk size or image dimensions and
use these values in parsing the chunk data or images The pointers to file addresses that occur in
the file formats are handled by pushing the current address in the file onto a pushdown stack
When the parser exhausts the procedures implementing production rules corresponding to the
pointed to location the topmost address on the pushdown stack is popped and the procedure
corresponding to the nonterminal at that position is executed
Recursive descent parsers with a pushdown stack can be used with binary file format grammars
to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads
a file and generates an error if the file does not conform to the syntax specified by the grammar
11
6 A Parser Generator for Recognizing Binary File Formats
What we want is a parser generator whose input is a binary file attribute grammar for a particular
binary file format and whose generated output is the Java source code of a recursive descent
parser (with a pushdown stack if needed) for the class of binary file formats specified by the
grammar Such a parser generator does not yet exist but there is a widely used parser generator
for attribute grammars for string-based languages
61 ANTLR
ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing
[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a
language and generates as output source code for a recognizer for that language ANTLR
supports generating code in a number of the programming languages including C Java
JavaScript and Python
ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that
converts a sequence of characters into tokens A lexer is not needed to parse binary file formats
because the data types predicted by the parser are the tokens of the grammar Furthermore the
capability to recognize LL() grammars is not needed because the binary file format grammars
specified so far do not require lookahead of k symbols
62 Using ANTLR to Generate a Parser for a Binary File Format
It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-
based and directory-based binary file formats This is accomplished by writing a lexer that treats
each input byte in a file as a character token Then functions are created for each data type in the
binary file grammar that convert the appropriate number of character tokens to binary data types
eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our
attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a
parser generator for binary file grammars
621 An ANTLR Grammar for an ILBM Chunk-based File Format
Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of
ANTLR is a lexer and parser for ILBM files
12
grammar iblm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in lexerheader is placed at the top of the lexer file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Grammar Rules used to create the Parser
Start symbol ilbm
iblm returns [boolean result]
FORM ckSize ILBM
Print out ILBM chunksize
Systemoutprintln(ID ILBM Size + $ckSizevalue)
Semantic predicate that ensures that at least on BMHD is found
propertyChunks $propertyChunksfoundBMHD == true
dataChunks
body
$result = true
The bmhd cmap and camg property chunks can occur in any order
At least one BMHD data chunk must be present
propertyChunks returns [boolean foundBMHD]
(bmhd $foundBMHD = true
| cmap
| camg)+
bmhd
BMHD ckSize
Systemoutprintln(ID BMHD Size + $ckSizevalue)
13
bitMapHeader
bitMapHeader
width = uWordValue
height = uWordValue
xPosition = wordValue
yPosition = wordValue
nPanes = uByteValue
masking = uByteValue
compression = uByteValue
reserved1 = uByteValue
transparentColor = uWordValue
xAspect = uByteValue
yAspect = uByteValue
pageWidth = wordValue
pageHeight = wordValue
Print out values in the bitmap structure
Systemoutprintln( Width + $widthvalue)
Systemoutprintln( Height + $heightvalue)
Systemoutprintln( X Position + $xPositionvalue)
Systemoutprintln( Y Position + $yPositionvalue)
Systemoutprintln( Number of Panes + $nPanesvalue)
Systemoutprintln( Masking + $maskingvalue)
Systemoutprintln( Compression + $compressionvalue)
Systemoutprintln( Reserved1 + $reserved1value)
Systemoutprintln( TransParent Color +
$transparentColorvalue)
Systemoutprintln( X Aspest + $xAspectvalue)
Systemoutprintln( Y Aspect + $yAspectvalue)
Systemoutprintln( Page Width + $pageWidthvalue)
Systemoutprintln( Page Height + $pageHeightvalue)
cmap returns [long value]
Initialize the index value i in the cmap function
init int i = 0
CMAP ckSize
Print out chunk size
$value = $ckSizevalue
Systemoutprintln(ID CMAP Size + $ckSizevalue)
(colorRegister
Print out color values
Systemoutprintln( Color[ + i++ + ] +
$colorRegisterredValue +
+ $colorRegistergreenValue +
+ $colorRegisterblueValue +
)
)+
colorRegister returns [int redValue int greenValue int blueValue]
14
red = uByteValue green = uByteValue blue = uByteValue
$redValue = $redvalue
$greenValue = $greenvalue
$blueValue = $bluevalue
camg
CAMG ckSize longValue
Systemoutprintln(ID CAMG Size + $ckSizevalue)
Systemoutprintln( View Modes 0x +
LongtoHexString($longValuevalue))
The crng and ccrt data chunks
dataChunks
crng
| ccrt
crng
CRNG ckSize
Systemoutprintln(ID CRNG Size + $ckSizevalue)
cRange
cRange
pad1 = wordValue $pad1value == 0
rate = wordValue
active = wordValue
low = uByteValue
high = uByteValue
ccrt
CCRT ckSize
Systemoutprintln(ID CCRT Size + $ckSizevalue)
cycleInfo
cycleInfo
direction = wordValue
start = uByteValue
end = uByteValue
seconds = longValue
microseconds = longValue
pad = wordValue $padvalue == 0
body
BODY ckSize
Systemoutprintln(ID BODY Size + $ckSizevalue)
$ckSizevalue gt 0 data
Systemoutprintln( Number of bytes + $datavalue)
$datavalue == $ckSizevalue
defines the data chunk of the body chunk
data returns [long value]
(b+=BYTE)+ $bsize() gt 0
$value = $bsize()
15
The cksize is the value of a long integer
ckSize returns [long value]
longValue
$value = $longValuevalue
defines the longValue data typein terms of BYTE terminals
longValue returns [long value] throws UnsupportedEncodingException
h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE
$value = (long)
( getUByteValue($h1) ltlt 24 |
getUByteValue($h2) ltlt 16 |
getUByteValue($l1) ltlt 8 |
getUByteValue($l2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uWordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (short)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uByteValue returns [int value]
BYTE
calls the getUByteValue from a BYTE token
16
$value = (int) getUByteValue($BYTE)
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rules used to create the Lexer
FORM FORM
ILBM ILBM
BMHD BMHD
CMAP CMAP
CAMG CAMG
CRNG CRNG
CCRT CCRT
BODY BODY
BYTE
Figure 6 An ANTLR Grammar for the ILBM File Format
622 The Lexer and Parser
When provided the file described in the previous section ANTLR generates the ilbmlexer and
ilbmparser Java programs The Java application shown in Figure 6 uses these programs to
validate the ILBM file BLKIFF It produces the output shown in Figure 7
package orggtrigrammarschunk
import javaioIOException
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRStringStream
import organtlrruntimeCharStream
import organtlrruntimeCommonTokenStream
import organtlrruntimeRecognitionException
17
import organtlrruntimeTokenStream
import gtriorggrammarschunkiblmLexer
import gtriorggrammarschunkiblmParser
public class Test
static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF
public static void main(String[] args) throws RecognitionException IOException
ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)
iblmLexer lexer = new iblmLexer(stream)
TokenStream tokenStream = new CommonTokenStream(lexer)
iblmParser parser = new iblmParser(tokenStream)
Systemoutprintln(FILE + filePath)
Systemoutprintln(Successfully parsed = + parseriblm())
Figure 7 A Java Application for Parsing an ILBM File
FILE CUserssl170PicturesIFF-ilbmBLKIFF
ID ILBM Size 2556
ID BMHD Size 20
Width 200
Height 144
X Position 0
Y Position 0
Number of Panes 6
Masking0
Compression1
Reserved10
TransParent Color0
X Aspest3
Y Aspect3
Page Width200
Page Height144
ID CMAP Size 48
Color[0] 0 0 0
Color[1] 0 128 128
Color[2] 0 255 255
Color[3] 255 0 255
Color[4] 0 255 0
Color[5] 255 0 0
Color[6] 0 0 255
Color[7] 255 255 0
Color[8] 128 128 128
Color[9] 192 192 192
Color[10] 128 0 0
Color[11] 0 128 0
Color[12] 128 128 0
Color[13] 0 0 128
Color[14] 128 0 128
18
Color[15] 255 255 255
ID CAMG Size 4
View Modes 0x800
ID BODY Size 2448
Number of bytes 2448
Successfully parsed = true
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF
63 An ANTLR Grammar for the RIFF Waveform PCM File Format
Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR
The output of ANTLR is a lexer and parser for WAVEFOR PCM files
grammar waveform_pcm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text contained in lexerheader is placed at the top of the lexer
file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Default ByteOrder to Little Endian
String byteOrder = II
Rules used to create the Parser
start symbol waveform_pcm
waveform_pcm returns [boolean result]
19
ckId = fourCC $ckIdvalueequals(RIFF)
ckSize = dWordValue
formType = fourCC $formTypevalueequals(WAVE)
formatSubChunk
dataSubChunk
$result = true
fourCC returns [String value]
BYTE BYTE BYTE BYTE $value = $text
formatSubChunk
subChunkId = fourCC $subChunkIdvalueequals(fmt )
ckSize = dWordValue
commonFields
pcmSpecificField
commonFields
formatTag = wordValue Format category
channels = wordValue Number of channels
samplesPerSec = dWordValue Sampling rate
avgBytesPerSec = dWordValue For buffer estimation
blockAlign = wordValue Data block size
pcmSpecificField
bitsPerSample = wordValue Sample size
dataSubChunk
subChunkId = fourCC $subChunkIdvalueequals(data)
ckSize = dWordValue
waveData[$ckSizevalue]
waveData [long size]
(size gt= 0=gt BYTE size--)+
defines the long data type in terms of BYTE terminals
dWordValue returns [long value] throws UnsupportedEncodingException
b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE
if (byteOrderequals(II))
$value = (long)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 |
getUByteValue($b3) ltlt 16 |
getUByteValue($b4) ltlt 24 )
else
$value = (long)
( getUByteValue($b1) ltlt 24 |
getUByteValue($b2) ltlt 16 |
20
getUByteValue($b3) ltlt 8 |
getUByteValue($b4) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
if (byteOrderequals(II))
$value = (int)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 )
else
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rule used to create the Lexer
BYTE
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format
7 Related Research
Researchers have developed a number of data description languages to support validation of file
formats EAST (Enhanced ADA Subset) is a data description language developed by the
Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary
Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic
information The EAST description is used to interpret and provide access to information in
binary and text files
DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to
manipulate Java jar files and ELF object files The PADSC data description language can be
used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC
compiler compiles the description into tools that can be used to recognize manipulate and
transform the data into other formats
21
The Data Format Description Language (DFDL) is being developed by a Working Group of the
Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML
Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-
XML physical representations Version 1 of the language specification was published in 2011
and a parser is being implemented
Each of the data description languages described above can be used to define data types and file
structures of binary files However they are based on file structure declarations such as those of
C or Java The binary file format grammar described in this paper most closely resembles DFDL
However the binary file grammar described in this paper is the only data description language
based on formal grammars that is used for creating recognizers for file formats
JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to
provide automated and efficient identification and validation of the formats of digital files
JHOVE is a format-specific digital object validation API written in Java JHOVE supports
validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF
UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California
digital Library 2011]
The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation
of binary file formats As a matter of fact binary file grammars and parsers have been
constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-
based format TIFF However the research reported herein differs from that of the JHOVE
project in that what is sought is a technology for generating validators for binary file formats
from grammars specifying the binary file format
8 Conclusion
The research question addressed by his research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files Amajor family of binary file formats chunk-based was described Then extensions
to the concepts of context-free grammars and attribute grammars were described that enable the
specification of binary file formats Examples of attribute array grammars for two chunk-based
binary file format were then presented Recursive descent parsers for this classes of grammars
was then described Experience in using ANTLR a parser generator for LL() string grammars
in generating parsers for recognizing the formats of binary files was discussed Finally related
research in data description languages for binary file formats was described and the JHOVE
project in creating Java-based tools for validating file formats was described
22
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats Furthermore these grammars can be used with recursive descent
parsers for validating the file formats of chunk-based binary files It remains to be determined
whether these attribute grammars based on binary file (array) grammars are adequate to define
other families of binary file formats
To be a practical technology for generating recognizers (validators) for binary file formats a
parser generator is needed that creates from a binary file grammar a recursive descent parser
(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and
generates a lexer as well as a parser is too heavyweight a parser generator for this task and does
not have the requisite data types built in
A capability to validate binary file formats assumes that the file format of a file is already
known In this research we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009] We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendix a We also have file
signature tests (magic tests) for all these file types
We would like to have a parser generator (also called a compilerndashcompiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents The
closest parser generator that we have found to having these capabilities is ANTLR
(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar However it works with strings and generates a separate scanner
Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based
file formats We have specifications for each of file formats and samples of files with these
formats
These file formats have a similar structure We have developed binary file grammars for two of
these binary file formats We should be able to do so for all of them We used these grammars
with the extensions to ANTLR to generate top-down recursive descent parsers for the file
formats defined with these grammars
The significance of success in this research task is that if binary file formats can be specified
with binary file grammars then only one parsing generator is needed to generate the parsers for
verifying that a file conforms to a particular format Similarly the same parsing generator can be
used for conversion of legacy file formats to current or standard formats Finally the same parser
generator can be used for generating viewerplayers for most file formats This would increase
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
2
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk starting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
Figure 1 File format (Layout) of a RIFF container for a Webp Image
Many binary file formats are specified using pseudo-regular expressions or pseudo-EBNF
notation EBNF (Extended Backus-Naur Form) is a notation for expressing context-free
grammars in a compact human readable way Pseudo-EBNF is similar in concept to pseudocode
which is a high-level description of a computer algorithm that is intended for human
understanding but that omits details that would be necessary for computer execution The
pseudo-EBNF (or regular expression) specification of a file format is augmented with a natural
language description of the details that are not actually expressible in the EBNF (or regular
expression) notation These details are often the context-sensitive relationships of the size of an
array to its actual length or the relationship of an address pointer to the actual location of the
data pointed to in the file These context-sensitive relationships are not expressible in a context-
free grammar (or EBNF notation) It is a goal of this research to extend the EBNF notation and
concept of a context-free grammar to include the details that are necessary to precisely specify
the file formats of binary file formats that include these context-sensitive features
Binary file formats with a similar file structure are referred to as a family of file formats There
are two families of binary file formats that are readily distinguished the chunk-based and the
directory-based file formats
Chunk-based file formats were created by Electronic Arts and Commodore-Amiga as the
Interchange File Format (IFF) [Morrison 1985] An IFF file itself is one entire IFF chunk A
chunk consists of an ID tag a size and size bytes of data The data may include chunks called
subchunks Subchunks have the same structure as chunks Subchunks can have subchunks
Chunk-based file formats are primarily used as containers for multimedia but have also been
used for word processing files including pictures and other figures File format specifications for
chunk-based formats may use other terms than chunks in describing the format for instance
atoms in QuickTimeMP4 segments in JPEG and ―tagged data representations as in
AutoCAD DXF
3
The Audio Interchange File Format (AIFF) developed by Apple computer in 1988 was based on
Electronic Arts Interchange File Format Apple Core Audio Format (CAF) Applelsquos replacement
format for AIFF remains a chunk-based format [Apple Computer 2005] The Resource
Interchange File Format (RIFF) introduced by IBM and Microsoft [1991] is also based on
Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-Video
Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis WebP a picture format
recently introduced by Google[2010] also uses RIFF as a container
Microsoftlsquos Advanced Systems Format (ASF) [2004] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV) Microsoftlsquos Binary Interchange File Format [Rentz 2008]
(Microsoft Excellsquos File Format) is chunk-based
Another family of binary file formats is directory-based A directory-based file format consists of
one or more directory tables which contain one or more directory entries A directory entry
specifies where the actual data for a type of information is located This is the scheme used in
TIFF files OLE (Microsoft Object Linking and Embedding) files OASIS OpenDocument and
Microsoft Open Office files
3 Context-Free Grammars and Attribute Grammars
Context-free grammars have been widely used to define programming languages A context-free
grammar G is a quadruple ltN Σ S Pgt where
N is a finite set of non-terminal symbols
Σ is a finite set of terminal symbols
S ε N is the start symbol
P is a set of production rules of the form N rarr N Σ
The set of all strings over a vocabulary Σ is denoted by Σ The language generated by a context-
free grammar G is the set L(G) = w w Σ and S w
31 Context-Free Grammars for Arrays of Data Types
An array is a data structure consisting of a collection of elements (values or variables) each
identified by an index An array is stored so that the position of each element can be computed
from its index For example an array of 10 integer variables with indices 0 through 9 may be
4
stored as 10 (4 byte) words at file addresses 0 4 8 hellip36 so that the element with index i has the
address 4 times i
Arrays are used to implement many other data structures such as lists and strings In most
modern computers the internal memory and external memory of storage devices (and files on
those devices) is a one-dimensional array of data types whose indices are their addresses
A data type is a classification of one of various types of data such as floating-point integer
character pointer or Boolean that determines the possible values for that type the operations
that can be performed on values of that type and the way values of that type can be stored Data
types have names such as int16 int32 float char bool ptr We define a binary file to be an array
of values of data of various types
We define a context-free binary file grammar BG as a quintuple ltN D Σ S Pgt where
N is a finite set of non-terminal symbols
D is a set of data types
Σ is a finite set of binary values of data types D called terminals
S ε N is the start symbol
P is a set of production rules of the form N rarr N Σ
Let DataTypes indicate the union of all the values of all datatypes D The set of all binary files is
BinaryFiles = [IndicesrarrDataTypes] where Indices = 1 The language generated by a
context-free array (binary file) grammar AG is the set L(AG) = w w BinaryFiles and S w
By primitive nonterminal is meant a symbol which would be replaced by one or more terminal
symbols were an additional production rule applied In other words a primitive nonterminal is a
nonterminal that appears on the left-hand side of productions which only have one or more
terminal symbols on the right We adopt the notation shown below for primitive nonterminals
ltprimitive_nonterminal_symbol type=datatype_name constant=valuegt
32 Attribute Grammars
Context-free grammars cannot represent context-sensitive aspects of programming languages
such as (1) enforcing the constraint that all variables are declared before they are used or (2)
checking the number of parameters in a function call against the number in the functionlsquos
declaration Context-free grammars have also been used to define the syntax of English and other
natural languages but they cannot define such context-sensitivity as subject verb agreement in
English sentences A number of extensions of context-free grammars have been proposed to
5
address the context-sensitivity of programming languages and natural languages These include
indexed [Aho 1968] recording [Barth 1979] affix [Koster 1971] VW [van Wijngaarden 1974]
and attribute grammars [Knuth 1969 1971]
Context-free grammars also cannot represent the semantics of programming languages Knuth
[1968 1971] proposed an extension of context-free grammars termed attribute grammars that
addresses the semantics as well as the context-sensitivity of programming languages
An attribute grammar GA is a triple ltG A ARgt where G is a context-free grammar for the
language A associates each grammar symbol X isin (N Σ) with a set of attributes and AR
associates each production R isin P with a set of attribute computation rules A(X) where X isin (N
Σ) can be further partitioned into two sets synthesized attributes S(X) and inherited attributes
I(X) AR(R) where R isin P contains rules for computing inherited and synthesized attributes
associated with the symbols in the production R Synthesized attributes are evaluated and passed
up from deeper terminals and non-terminals while inherited attributes are passed down from
upper non-terminals
An L-attributed grammar is an attribute grammar that uses
i) synthesized attributes and
ii) inherited attributes where the inherited attribute depends only on
a) attributes of the other child nodes to its left
b) inherited attributes of P itself
L-attributed grammars allow the attributes to be evaluated in one left-to-right traversal of the
abstract syntax tree As a result attribute evaluation in L-attributed grammars can be
incorporated conveniently in top-down parsing
33 Semantic Grammar
An interpreter for a binary file format that is using a grammar to display or play the contents of a
file or to convert it to another file format needs to produce a semantic representation that is
appropriate to the task Natural language understanding components of dialog systems face a
similar need to produce a semantic representation that is appropriate to the dialogue task To
address this need some developers of dialogue systems make use of an approach called semantic
grammar [Issar and Ward 1993 Ward and Issar 1994] A semantic grammar is a CFG in which
the non-terminals on the left-hand side of rules correspond to semantic entities A parser for a
semantic grammar produces a labeling of the input string with semantic node labels for example
SHOW FLIGHTS ORIGIN DESTINATION DEPART_DATE DEPART_TIME Show me flights from Boston to San Francisco on Tuesday morning
6
A semantic grammar is just what we need to file in the values of a programming language data
structure Semantic grammars are not applicable to generic systems but are limited to specific
domains This is not a problem in specifying or parsing a binary file format since a grammar for
a binary file format is not meant to apply to all binary file formats but just to a single format or
to a family of file formats
4 Grammars for Chunk-based Binary File Formats
The attribute grammar that we present will be a mixture of synthesized semantic rules (attribute
values of the left side of a production are calculated from attribute values of symbols on the right
side) and inherited rules (attribute values of symbols on the right side of a production are
calculated from attribute values of the symbol on the left side)
The field names of the file format spec are represented as nonterminals in the grammar Constant
values in fields are terminals that appear on the right-hand side of productions Nonterminal
fieldnames have data types associated with them
41 InterLeaved BitMap (ILBM)
The author of the Interleaved Bitmap (ILBM) file format recognized that the structure of the file
format could be specified using pseudo EBNF rules and C data structure declarations [Morrison
1986] In this section the ―grammar of the ILBM file format specification is converted to a
context-free binary grammar with synthesized and inherited attributes
Figure 2 shows a display of an ILBM file hampic Figure 3 shows an attribute grammar for its
chunk-based binary file format The attribute grammar is a mixture of synthesized semantic rules
(attribute values of the left side of a production are calculated from attribute values of symbols
on the right side) and inherited rules (attribute values of symbols on the right side of a
production are calculated from attribute values of the symbol on the left side) The attribute
grammar is also an L-attributed grammar Figure 4 shows a parse tree for the file hampic
7
Figure 2 ILBM File hampic Displayed
ltILBMgt rarr ―FORM ltcksize TYPE=UINT32gt ―ILBM ltBMHDgt ltCMAPgt ltCAMGgt ltBODYgt
ILBMcksize = cksizevalue
ltBMHDgt rarr ―BMHD ltcksize TYPE=uint32gt ltBitmapHeadergt
ltBitMapheadergt rarr ltwidth type=UINT32gt
lt height type=UINT16gt
lt xposition type=INT16gt
ltyposition type=INT16gt
ltnplanes type=BYTEgt
ltmasking type=BYTEgt
ltcompression type=BYTEgt
ltreserved type=BYTEgt
lttransparentcolor type=UINT16gt
ltxaspect type=BYTEgt
ltyaspect type=BYTEgt
ltpagewidth type=INT16gt
ltpageheight type=INT16gt
ltCMAPgt -rarr―CMAP ltcksize type=UINT32gt ltcolorgt[n] n=cksizeval3
ltCAMGgt rarr ―CAMG ltcksizetype=UINT32gt ltviewmode type=INT32gt
ltBODYgt rarr ―BODY ltcksize type=UINT32gt ltdata type=BYTEgt[cksize]
ltcolorgt rarr ltred type=BYTEgt ltgreen type= BYTEgt ltblue type=BYTEgt
Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format
8
Figure 4 A Parse Tree for the ILBM File hampic
9
42 RIFF Waveform PCM Audio File Format (WAVE)
The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp
Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF
rules with the cksize inserted To convert this grammar into a binary file format grammar data
types are added for primitive non-terminals
ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE
ltfmt-ckgt
[ltfact-ckgt]
[ltcue-ckgt]
[ltplaylist-ckgt]
[ltassoc-data-listgt]
ltwave-datagt
ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt
ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt
ltwChannelsWORDgt
ltdwSamplesPerSecDWORDgt
ltdwAvgBytesPerSecDWORDgt
ltwBlockAlignWORDgt
ltwave-datagt -gt ltdata-ckgt | ltwave-listgt
ltdata-ckgt -gt data cksize ltwave-datagt
ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt
ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt
ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt
ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+
ltcue-pointgt -gt ltdwNameDWORDgt
ltdwPositionDWORDgt
ltfccChunkFOURCCgt
ltdwChunkStartDWORDgt
ltdwBlockStartDWORDgt
ltdwSampleOffsetDWORDgt
ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+
ltplay-segmentgt -gt ltdwNameDWORDgt
ltdwLengthDWORDgt
ltdwLoopsDWORD
ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt
ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt
ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt
10
ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt
ltdwSampleLengthDWORDgt
ltdwPurposeDWORDgt
ltwCountryWORDgt
ltwLanguageWORDgt
ltwDialectWORDgt
ltwCodePageWORDgt
ltdataBYTEgt+
ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt
ltfileDataBYTEgt+
Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format
5 Recursive Descent Parsers for Binary File Grammars
A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)
procedures where each procedure implements one of the production rules of the grammar A
predictive parser is a recursive descent parser that does not require backtracking Predictive
parsing is possible only for the class of LL(k) grammars which are the context-free grammars
for which there exists some positive integer k that allows a recursive descent parser to decide
which production to use by examining only the next k tokens of input A recursive descent parser
for the binary file format grammars defined in this paper does not require a separate lexer to
identify the data of a file The data types are predicted by the grammar rules
The semantic rules of an L-attributed grammar can be included in the recursive descent parser to
check for context-sensitive aspects of the grammar such as chunk size or image dimensions and
use these values in parsing the chunk data or images The pointers to file addresses that occur in
the file formats are handled by pushing the current address in the file onto a pushdown stack
When the parser exhausts the procedures implementing production rules corresponding to the
pointed to location the topmost address on the pushdown stack is popped and the procedure
corresponding to the nonterminal at that position is executed
Recursive descent parsers with a pushdown stack can be used with binary file format grammars
to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads
a file and generates an error if the file does not conform to the syntax specified by the grammar
11
6 A Parser Generator for Recognizing Binary File Formats
What we want is a parser generator whose input is a binary file attribute grammar for a particular
binary file format and whose generated output is the Java source code of a recursive descent
parser (with a pushdown stack if needed) for the class of binary file formats specified by the
grammar Such a parser generator does not yet exist but there is a widely used parser generator
for attribute grammars for string-based languages
61 ANTLR
ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing
[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a
language and generates as output source code for a recognizer for that language ANTLR
supports generating code in a number of the programming languages including C Java
JavaScript and Python
ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that
converts a sequence of characters into tokens A lexer is not needed to parse binary file formats
because the data types predicted by the parser are the tokens of the grammar Furthermore the
capability to recognize LL() grammars is not needed because the binary file format grammars
specified so far do not require lookahead of k symbols
62 Using ANTLR to Generate a Parser for a Binary File Format
It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-
based and directory-based binary file formats This is accomplished by writing a lexer that treats
each input byte in a file as a character token Then functions are created for each data type in the
binary file grammar that convert the appropriate number of character tokens to binary data types
eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our
attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a
parser generator for binary file grammars
621 An ANTLR Grammar for an ILBM Chunk-based File Format
Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of
ANTLR is a lexer and parser for ILBM files
12
grammar iblm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in lexerheader is placed at the top of the lexer file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Grammar Rules used to create the Parser
Start symbol ilbm
iblm returns [boolean result]
FORM ckSize ILBM
Print out ILBM chunksize
Systemoutprintln(ID ILBM Size + $ckSizevalue)
Semantic predicate that ensures that at least on BMHD is found
propertyChunks $propertyChunksfoundBMHD == true
dataChunks
body
$result = true
The bmhd cmap and camg property chunks can occur in any order
At least one BMHD data chunk must be present
propertyChunks returns [boolean foundBMHD]
(bmhd $foundBMHD = true
| cmap
| camg)+
bmhd
BMHD ckSize
Systemoutprintln(ID BMHD Size + $ckSizevalue)
13
bitMapHeader
bitMapHeader
width = uWordValue
height = uWordValue
xPosition = wordValue
yPosition = wordValue
nPanes = uByteValue
masking = uByteValue
compression = uByteValue
reserved1 = uByteValue
transparentColor = uWordValue
xAspect = uByteValue
yAspect = uByteValue
pageWidth = wordValue
pageHeight = wordValue
Print out values in the bitmap structure
Systemoutprintln( Width + $widthvalue)
Systemoutprintln( Height + $heightvalue)
Systemoutprintln( X Position + $xPositionvalue)
Systemoutprintln( Y Position + $yPositionvalue)
Systemoutprintln( Number of Panes + $nPanesvalue)
Systemoutprintln( Masking + $maskingvalue)
Systemoutprintln( Compression + $compressionvalue)
Systemoutprintln( Reserved1 + $reserved1value)
Systemoutprintln( TransParent Color +
$transparentColorvalue)
Systemoutprintln( X Aspest + $xAspectvalue)
Systemoutprintln( Y Aspect + $yAspectvalue)
Systemoutprintln( Page Width + $pageWidthvalue)
Systemoutprintln( Page Height + $pageHeightvalue)
cmap returns [long value]
Initialize the index value i in the cmap function
init int i = 0
CMAP ckSize
Print out chunk size
$value = $ckSizevalue
Systemoutprintln(ID CMAP Size + $ckSizevalue)
(colorRegister
Print out color values
Systemoutprintln( Color[ + i++ + ] +
$colorRegisterredValue +
+ $colorRegistergreenValue +
+ $colorRegisterblueValue +
)
)+
colorRegister returns [int redValue int greenValue int blueValue]
14
red = uByteValue green = uByteValue blue = uByteValue
$redValue = $redvalue
$greenValue = $greenvalue
$blueValue = $bluevalue
camg
CAMG ckSize longValue
Systemoutprintln(ID CAMG Size + $ckSizevalue)
Systemoutprintln( View Modes 0x +
LongtoHexString($longValuevalue))
The crng and ccrt data chunks
dataChunks
crng
| ccrt
crng
CRNG ckSize
Systemoutprintln(ID CRNG Size + $ckSizevalue)
cRange
cRange
pad1 = wordValue $pad1value == 0
rate = wordValue
active = wordValue
low = uByteValue
high = uByteValue
ccrt
CCRT ckSize
Systemoutprintln(ID CCRT Size + $ckSizevalue)
cycleInfo
cycleInfo
direction = wordValue
start = uByteValue
end = uByteValue
seconds = longValue
microseconds = longValue
pad = wordValue $padvalue == 0
body
BODY ckSize
Systemoutprintln(ID BODY Size + $ckSizevalue)
$ckSizevalue gt 0 data
Systemoutprintln( Number of bytes + $datavalue)
$datavalue == $ckSizevalue
defines the data chunk of the body chunk
data returns [long value]
(b+=BYTE)+ $bsize() gt 0
$value = $bsize()
15
The cksize is the value of a long integer
ckSize returns [long value]
longValue
$value = $longValuevalue
defines the longValue data typein terms of BYTE terminals
longValue returns [long value] throws UnsupportedEncodingException
h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE
$value = (long)
( getUByteValue($h1) ltlt 24 |
getUByteValue($h2) ltlt 16 |
getUByteValue($l1) ltlt 8 |
getUByteValue($l2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uWordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (short)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uByteValue returns [int value]
BYTE
calls the getUByteValue from a BYTE token
16
$value = (int) getUByteValue($BYTE)
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rules used to create the Lexer
FORM FORM
ILBM ILBM
BMHD BMHD
CMAP CMAP
CAMG CAMG
CRNG CRNG
CCRT CCRT
BODY BODY
BYTE
Figure 6 An ANTLR Grammar for the ILBM File Format
622 The Lexer and Parser
When provided the file described in the previous section ANTLR generates the ilbmlexer and
ilbmparser Java programs The Java application shown in Figure 6 uses these programs to
validate the ILBM file BLKIFF It produces the output shown in Figure 7
package orggtrigrammarschunk
import javaioIOException
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRStringStream
import organtlrruntimeCharStream
import organtlrruntimeCommonTokenStream
import organtlrruntimeRecognitionException
17
import organtlrruntimeTokenStream
import gtriorggrammarschunkiblmLexer
import gtriorggrammarschunkiblmParser
public class Test
static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF
public static void main(String[] args) throws RecognitionException IOException
ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)
iblmLexer lexer = new iblmLexer(stream)
TokenStream tokenStream = new CommonTokenStream(lexer)
iblmParser parser = new iblmParser(tokenStream)
Systemoutprintln(FILE + filePath)
Systemoutprintln(Successfully parsed = + parseriblm())
Figure 7 A Java Application for Parsing an ILBM File
FILE CUserssl170PicturesIFF-ilbmBLKIFF
ID ILBM Size 2556
ID BMHD Size 20
Width 200
Height 144
X Position 0
Y Position 0
Number of Panes 6
Masking0
Compression1
Reserved10
TransParent Color0
X Aspest3
Y Aspect3
Page Width200
Page Height144
ID CMAP Size 48
Color[0] 0 0 0
Color[1] 0 128 128
Color[2] 0 255 255
Color[3] 255 0 255
Color[4] 0 255 0
Color[5] 255 0 0
Color[6] 0 0 255
Color[7] 255 255 0
Color[8] 128 128 128
Color[9] 192 192 192
Color[10] 128 0 0
Color[11] 0 128 0
Color[12] 128 128 0
Color[13] 0 0 128
Color[14] 128 0 128
18
Color[15] 255 255 255
ID CAMG Size 4
View Modes 0x800
ID BODY Size 2448
Number of bytes 2448
Successfully parsed = true
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF
63 An ANTLR Grammar for the RIFF Waveform PCM File Format
Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR
The output of ANTLR is a lexer and parser for WAVEFOR PCM files
grammar waveform_pcm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text contained in lexerheader is placed at the top of the lexer
file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Default ByteOrder to Little Endian
String byteOrder = II
Rules used to create the Parser
start symbol waveform_pcm
waveform_pcm returns [boolean result]
19
ckId = fourCC $ckIdvalueequals(RIFF)
ckSize = dWordValue
formType = fourCC $formTypevalueequals(WAVE)
formatSubChunk
dataSubChunk
$result = true
fourCC returns [String value]
BYTE BYTE BYTE BYTE $value = $text
formatSubChunk
subChunkId = fourCC $subChunkIdvalueequals(fmt )
ckSize = dWordValue
commonFields
pcmSpecificField
commonFields
formatTag = wordValue Format category
channels = wordValue Number of channels
samplesPerSec = dWordValue Sampling rate
avgBytesPerSec = dWordValue For buffer estimation
blockAlign = wordValue Data block size
pcmSpecificField
bitsPerSample = wordValue Sample size
dataSubChunk
subChunkId = fourCC $subChunkIdvalueequals(data)
ckSize = dWordValue
waveData[$ckSizevalue]
waveData [long size]
(size gt= 0=gt BYTE size--)+
defines the long data type in terms of BYTE terminals
dWordValue returns [long value] throws UnsupportedEncodingException
b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE
if (byteOrderequals(II))
$value = (long)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 |
getUByteValue($b3) ltlt 16 |
getUByteValue($b4) ltlt 24 )
else
$value = (long)
( getUByteValue($b1) ltlt 24 |
getUByteValue($b2) ltlt 16 |
20
getUByteValue($b3) ltlt 8 |
getUByteValue($b4) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
if (byteOrderequals(II))
$value = (int)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 )
else
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rule used to create the Lexer
BYTE
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format
7 Related Research
Researchers have developed a number of data description languages to support validation of file
formats EAST (Enhanced ADA Subset) is a data description language developed by the
Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary
Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic
information The EAST description is used to interpret and provide access to information in
binary and text files
DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to
manipulate Java jar files and ELF object files The PADSC data description language can be
used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC
compiler compiles the description into tools that can be used to recognize manipulate and
transform the data into other formats
21
The Data Format Description Language (DFDL) is being developed by a Working Group of the
Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML
Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-
XML physical representations Version 1 of the language specification was published in 2011
and a parser is being implemented
Each of the data description languages described above can be used to define data types and file
structures of binary files However they are based on file structure declarations such as those of
C or Java The binary file format grammar described in this paper most closely resembles DFDL
However the binary file grammar described in this paper is the only data description language
based on formal grammars that is used for creating recognizers for file formats
JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to
provide automated and efficient identification and validation of the formats of digital files
JHOVE is a format-specific digital object validation API written in Java JHOVE supports
validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF
UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California
digital Library 2011]
The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation
of binary file formats As a matter of fact binary file grammars and parsers have been
constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-
based format TIFF However the research reported herein differs from that of the JHOVE
project in that what is sought is a technology for generating validators for binary file formats
from grammars specifying the binary file format
8 Conclusion
The research question addressed by his research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files Amajor family of binary file formats chunk-based was described Then extensions
to the concepts of context-free grammars and attribute grammars were described that enable the
specification of binary file formats Examples of attribute array grammars for two chunk-based
binary file format were then presented Recursive descent parsers for this classes of grammars
was then described Experience in using ANTLR a parser generator for LL() string grammars
in generating parsers for recognizing the formats of binary files was discussed Finally related
research in data description languages for binary file formats was described and the JHOVE
project in creating Java-based tools for validating file formats was described
22
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats Furthermore these grammars can be used with recursive descent
parsers for validating the file formats of chunk-based binary files It remains to be determined
whether these attribute grammars based on binary file (array) grammars are adequate to define
other families of binary file formats
To be a practical technology for generating recognizers (validators) for binary file formats a
parser generator is needed that creates from a binary file grammar a recursive descent parser
(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and
generates a lexer as well as a parser is too heavyweight a parser generator for this task and does
not have the requisite data types built in
A capability to validate binary file formats assumes that the file format of a file is already
known In this research we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009] We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendix a We also have file
signature tests (magic tests) for all these file types
We would like to have a parser generator (also called a compilerndashcompiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents The
closest parser generator that we have found to having these capabilities is ANTLR
(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar However it works with strings and generates a separate scanner
Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based
file formats We have specifications for each of file formats and samples of files with these
formats
These file formats have a similar structure We have developed binary file grammars for two of
these binary file formats We should be able to do so for all of them We used these grammars
with the extensions to ANTLR to generate top-down recursive descent parsers for the file
formats defined with these grammars
The significance of success in this research task is that if binary file formats can be specified
with binary file grammars then only one parsing generator is needed to generate the parsers for
verifying that a file conforms to a particular format Similarly the same parsing generator can be
used for conversion of legacy file formats to current or standard formats Finally the same parser
generator can be used for generating viewerplayers for most file formats This would increase
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
3
The Audio Interchange File Format (AIFF) developed by Apple computer in 1988 was based on
Electronic Arts Interchange File Format Apple Core Audio Format (CAF) Applelsquos replacement
format for AIFF remains a chunk-based format [Apple Computer 2005] The Resource
Interchange File Format (RIFF) introduced by IBM and Microsoft [1991] is also based on
Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-Video
Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis WebP a picture format
recently introduced by Google[2010] also uses RIFF as a container
Microsoftlsquos Advanced Systems Format (ASF) [2004] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV) Microsoftlsquos Binary Interchange File Format [Rentz 2008]
(Microsoft Excellsquos File Format) is chunk-based
Another family of binary file formats is directory-based A directory-based file format consists of
one or more directory tables which contain one or more directory entries A directory entry
specifies where the actual data for a type of information is located This is the scheme used in
TIFF files OLE (Microsoft Object Linking and Embedding) files OASIS OpenDocument and
Microsoft Open Office files
3 Context-Free Grammars and Attribute Grammars
Context-free grammars have been widely used to define programming languages A context-free
grammar G is a quadruple ltN Σ S Pgt where
N is a finite set of non-terminal symbols
Σ is a finite set of terminal symbols
S ε N is the start symbol
P is a set of production rules of the form N rarr N Σ
The set of all strings over a vocabulary Σ is denoted by Σ The language generated by a context-
free grammar G is the set L(G) = w w Σ and S w
31 Context-Free Grammars for Arrays of Data Types
An array is a data structure consisting of a collection of elements (values or variables) each
identified by an index An array is stored so that the position of each element can be computed
from its index For example an array of 10 integer variables with indices 0 through 9 may be
4
stored as 10 (4 byte) words at file addresses 0 4 8 hellip36 so that the element with index i has the
address 4 times i
Arrays are used to implement many other data structures such as lists and strings In most
modern computers the internal memory and external memory of storage devices (and files on
those devices) is a one-dimensional array of data types whose indices are their addresses
A data type is a classification of one of various types of data such as floating-point integer
character pointer or Boolean that determines the possible values for that type the operations
that can be performed on values of that type and the way values of that type can be stored Data
types have names such as int16 int32 float char bool ptr We define a binary file to be an array
of values of data of various types
We define a context-free binary file grammar BG as a quintuple ltN D Σ S Pgt where
N is a finite set of non-terminal symbols
D is a set of data types
Σ is a finite set of binary values of data types D called terminals
S ε N is the start symbol
P is a set of production rules of the form N rarr N Σ
Let DataTypes indicate the union of all the values of all datatypes D The set of all binary files is
BinaryFiles = [IndicesrarrDataTypes] where Indices = 1 The language generated by a
context-free array (binary file) grammar AG is the set L(AG) = w w BinaryFiles and S w
By primitive nonterminal is meant a symbol which would be replaced by one or more terminal
symbols were an additional production rule applied In other words a primitive nonterminal is a
nonterminal that appears on the left-hand side of productions which only have one or more
terminal symbols on the right We adopt the notation shown below for primitive nonterminals
ltprimitive_nonterminal_symbol type=datatype_name constant=valuegt
32 Attribute Grammars
Context-free grammars cannot represent context-sensitive aspects of programming languages
such as (1) enforcing the constraint that all variables are declared before they are used or (2)
checking the number of parameters in a function call against the number in the functionlsquos
declaration Context-free grammars have also been used to define the syntax of English and other
natural languages but they cannot define such context-sensitivity as subject verb agreement in
English sentences A number of extensions of context-free grammars have been proposed to
5
address the context-sensitivity of programming languages and natural languages These include
indexed [Aho 1968] recording [Barth 1979] affix [Koster 1971] VW [van Wijngaarden 1974]
and attribute grammars [Knuth 1969 1971]
Context-free grammars also cannot represent the semantics of programming languages Knuth
[1968 1971] proposed an extension of context-free grammars termed attribute grammars that
addresses the semantics as well as the context-sensitivity of programming languages
An attribute grammar GA is a triple ltG A ARgt where G is a context-free grammar for the
language A associates each grammar symbol X isin (N Σ) with a set of attributes and AR
associates each production R isin P with a set of attribute computation rules A(X) where X isin (N
Σ) can be further partitioned into two sets synthesized attributes S(X) and inherited attributes
I(X) AR(R) where R isin P contains rules for computing inherited and synthesized attributes
associated with the symbols in the production R Synthesized attributes are evaluated and passed
up from deeper terminals and non-terminals while inherited attributes are passed down from
upper non-terminals
An L-attributed grammar is an attribute grammar that uses
i) synthesized attributes and
ii) inherited attributes where the inherited attribute depends only on
a) attributes of the other child nodes to its left
b) inherited attributes of P itself
L-attributed grammars allow the attributes to be evaluated in one left-to-right traversal of the
abstract syntax tree As a result attribute evaluation in L-attributed grammars can be
incorporated conveniently in top-down parsing
33 Semantic Grammar
An interpreter for a binary file format that is using a grammar to display or play the contents of a
file or to convert it to another file format needs to produce a semantic representation that is
appropriate to the task Natural language understanding components of dialog systems face a
similar need to produce a semantic representation that is appropriate to the dialogue task To
address this need some developers of dialogue systems make use of an approach called semantic
grammar [Issar and Ward 1993 Ward and Issar 1994] A semantic grammar is a CFG in which
the non-terminals on the left-hand side of rules correspond to semantic entities A parser for a
semantic grammar produces a labeling of the input string with semantic node labels for example
SHOW FLIGHTS ORIGIN DESTINATION DEPART_DATE DEPART_TIME Show me flights from Boston to San Francisco on Tuesday morning
6
A semantic grammar is just what we need to file in the values of a programming language data
structure Semantic grammars are not applicable to generic systems but are limited to specific
domains This is not a problem in specifying or parsing a binary file format since a grammar for
a binary file format is not meant to apply to all binary file formats but just to a single format or
to a family of file formats
4 Grammars for Chunk-based Binary File Formats
The attribute grammar that we present will be a mixture of synthesized semantic rules (attribute
values of the left side of a production are calculated from attribute values of symbols on the right
side) and inherited rules (attribute values of symbols on the right side of a production are
calculated from attribute values of the symbol on the left side)
The field names of the file format spec are represented as nonterminals in the grammar Constant
values in fields are terminals that appear on the right-hand side of productions Nonterminal
fieldnames have data types associated with them
41 InterLeaved BitMap (ILBM)
The author of the Interleaved Bitmap (ILBM) file format recognized that the structure of the file
format could be specified using pseudo EBNF rules and C data structure declarations [Morrison
1986] In this section the ―grammar of the ILBM file format specification is converted to a
context-free binary grammar with synthesized and inherited attributes
Figure 2 shows a display of an ILBM file hampic Figure 3 shows an attribute grammar for its
chunk-based binary file format The attribute grammar is a mixture of synthesized semantic rules
(attribute values of the left side of a production are calculated from attribute values of symbols
on the right side) and inherited rules (attribute values of symbols on the right side of a
production are calculated from attribute values of the symbol on the left side) The attribute
grammar is also an L-attributed grammar Figure 4 shows a parse tree for the file hampic
7
Figure 2 ILBM File hampic Displayed
ltILBMgt rarr ―FORM ltcksize TYPE=UINT32gt ―ILBM ltBMHDgt ltCMAPgt ltCAMGgt ltBODYgt
ILBMcksize = cksizevalue
ltBMHDgt rarr ―BMHD ltcksize TYPE=uint32gt ltBitmapHeadergt
ltBitMapheadergt rarr ltwidth type=UINT32gt
lt height type=UINT16gt
lt xposition type=INT16gt
ltyposition type=INT16gt
ltnplanes type=BYTEgt
ltmasking type=BYTEgt
ltcompression type=BYTEgt
ltreserved type=BYTEgt
lttransparentcolor type=UINT16gt
ltxaspect type=BYTEgt
ltyaspect type=BYTEgt
ltpagewidth type=INT16gt
ltpageheight type=INT16gt
ltCMAPgt -rarr―CMAP ltcksize type=UINT32gt ltcolorgt[n] n=cksizeval3
ltCAMGgt rarr ―CAMG ltcksizetype=UINT32gt ltviewmode type=INT32gt
ltBODYgt rarr ―BODY ltcksize type=UINT32gt ltdata type=BYTEgt[cksize]
ltcolorgt rarr ltred type=BYTEgt ltgreen type= BYTEgt ltblue type=BYTEgt
Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format
8
Figure 4 A Parse Tree for the ILBM File hampic
9
42 RIFF Waveform PCM Audio File Format (WAVE)
The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp
Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF
rules with the cksize inserted To convert this grammar into a binary file format grammar data
types are added for primitive non-terminals
ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE
ltfmt-ckgt
[ltfact-ckgt]
[ltcue-ckgt]
[ltplaylist-ckgt]
[ltassoc-data-listgt]
ltwave-datagt
ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt
ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt
ltwChannelsWORDgt
ltdwSamplesPerSecDWORDgt
ltdwAvgBytesPerSecDWORDgt
ltwBlockAlignWORDgt
ltwave-datagt -gt ltdata-ckgt | ltwave-listgt
ltdata-ckgt -gt data cksize ltwave-datagt
ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt
ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt
ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt
ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+
ltcue-pointgt -gt ltdwNameDWORDgt
ltdwPositionDWORDgt
ltfccChunkFOURCCgt
ltdwChunkStartDWORDgt
ltdwBlockStartDWORDgt
ltdwSampleOffsetDWORDgt
ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+
ltplay-segmentgt -gt ltdwNameDWORDgt
ltdwLengthDWORDgt
ltdwLoopsDWORD
ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt
ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt
ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt
10
ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt
ltdwSampleLengthDWORDgt
ltdwPurposeDWORDgt
ltwCountryWORDgt
ltwLanguageWORDgt
ltwDialectWORDgt
ltwCodePageWORDgt
ltdataBYTEgt+
ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt
ltfileDataBYTEgt+
Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format
5 Recursive Descent Parsers for Binary File Grammars
A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)
procedures where each procedure implements one of the production rules of the grammar A
predictive parser is a recursive descent parser that does not require backtracking Predictive
parsing is possible only for the class of LL(k) grammars which are the context-free grammars
for which there exists some positive integer k that allows a recursive descent parser to decide
which production to use by examining only the next k tokens of input A recursive descent parser
for the binary file format grammars defined in this paper does not require a separate lexer to
identify the data of a file The data types are predicted by the grammar rules
The semantic rules of an L-attributed grammar can be included in the recursive descent parser to
check for context-sensitive aspects of the grammar such as chunk size or image dimensions and
use these values in parsing the chunk data or images The pointers to file addresses that occur in
the file formats are handled by pushing the current address in the file onto a pushdown stack
When the parser exhausts the procedures implementing production rules corresponding to the
pointed to location the topmost address on the pushdown stack is popped and the procedure
corresponding to the nonterminal at that position is executed
Recursive descent parsers with a pushdown stack can be used with binary file format grammars
to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads
a file and generates an error if the file does not conform to the syntax specified by the grammar
11
6 A Parser Generator for Recognizing Binary File Formats
What we want is a parser generator whose input is a binary file attribute grammar for a particular
binary file format and whose generated output is the Java source code of a recursive descent
parser (with a pushdown stack if needed) for the class of binary file formats specified by the
grammar Such a parser generator does not yet exist but there is a widely used parser generator
for attribute grammars for string-based languages
61 ANTLR
ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing
[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a
language and generates as output source code for a recognizer for that language ANTLR
supports generating code in a number of the programming languages including C Java
JavaScript and Python
ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that
converts a sequence of characters into tokens A lexer is not needed to parse binary file formats
because the data types predicted by the parser are the tokens of the grammar Furthermore the
capability to recognize LL() grammars is not needed because the binary file format grammars
specified so far do not require lookahead of k symbols
62 Using ANTLR to Generate a Parser for a Binary File Format
It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-
based and directory-based binary file formats This is accomplished by writing a lexer that treats
each input byte in a file as a character token Then functions are created for each data type in the
binary file grammar that convert the appropriate number of character tokens to binary data types
eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our
attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a
parser generator for binary file grammars
621 An ANTLR Grammar for an ILBM Chunk-based File Format
Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of
ANTLR is a lexer and parser for ILBM files
12
grammar iblm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in lexerheader is placed at the top of the lexer file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Grammar Rules used to create the Parser
Start symbol ilbm
iblm returns [boolean result]
FORM ckSize ILBM
Print out ILBM chunksize
Systemoutprintln(ID ILBM Size + $ckSizevalue)
Semantic predicate that ensures that at least on BMHD is found
propertyChunks $propertyChunksfoundBMHD == true
dataChunks
body
$result = true
The bmhd cmap and camg property chunks can occur in any order
At least one BMHD data chunk must be present
propertyChunks returns [boolean foundBMHD]
(bmhd $foundBMHD = true
| cmap
| camg)+
bmhd
BMHD ckSize
Systemoutprintln(ID BMHD Size + $ckSizevalue)
13
bitMapHeader
bitMapHeader
width = uWordValue
height = uWordValue
xPosition = wordValue
yPosition = wordValue
nPanes = uByteValue
masking = uByteValue
compression = uByteValue
reserved1 = uByteValue
transparentColor = uWordValue
xAspect = uByteValue
yAspect = uByteValue
pageWidth = wordValue
pageHeight = wordValue
Print out values in the bitmap structure
Systemoutprintln( Width + $widthvalue)
Systemoutprintln( Height + $heightvalue)
Systemoutprintln( X Position + $xPositionvalue)
Systemoutprintln( Y Position + $yPositionvalue)
Systemoutprintln( Number of Panes + $nPanesvalue)
Systemoutprintln( Masking + $maskingvalue)
Systemoutprintln( Compression + $compressionvalue)
Systemoutprintln( Reserved1 + $reserved1value)
Systemoutprintln( TransParent Color +
$transparentColorvalue)
Systemoutprintln( X Aspest + $xAspectvalue)
Systemoutprintln( Y Aspect + $yAspectvalue)
Systemoutprintln( Page Width + $pageWidthvalue)
Systemoutprintln( Page Height + $pageHeightvalue)
cmap returns [long value]
Initialize the index value i in the cmap function
init int i = 0
CMAP ckSize
Print out chunk size
$value = $ckSizevalue
Systemoutprintln(ID CMAP Size + $ckSizevalue)
(colorRegister
Print out color values
Systemoutprintln( Color[ + i++ + ] +
$colorRegisterredValue +
+ $colorRegistergreenValue +
+ $colorRegisterblueValue +
)
)+
colorRegister returns [int redValue int greenValue int blueValue]
14
red = uByteValue green = uByteValue blue = uByteValue
$redValue = $redvalue
$greenValue = $greenvalue
$blueValue = $bluevalue
camg
CAMG ckSize longValue
Systemoutprintln(ID CAMG Size + $ckSizevalue)
Systemoutprintln( View Modes 0x +
LongtoHexString($longValuevalue))
The crng and ccrt data chunks
dataChunks
crng
| ccrt
crng
CRNG ckSize
Systemoutprintln(ID CRNG Size + $ckSizevalue)
cRange
cRange
pad1 = wordValue $pad1value == 0
rate = wordValue
active = wordValue
low = uByteValue
high = uByteValue
ccrt
CCRT ckSize
Systemoutprintln(ID CCRT Size + $ckSizevalue)
cycleInfo
cycleInfo
direction = wordValue
start = uByteValue
end = uByteValue
seconds = longValue
microseconds = longValue
pad = wordValue $padvalue == 0
body
BODY ckSize
Systemoutprintln(ID BODY Size + $ckSizevalue)
$ckSizevalue gt 0 data
Systemoutprintln( Number of bytes + $datavalue)
$datavalue == $ckSizevalue
defines the data chunk of the body chunk
data returns [long value]
(b+=BYTE)+ $bsize() gt 0
$value = $bsize()
15
The cksize is the value of a long integer
ckSize returns [long value]
longValue
$value = $longValuevalue
defines the longValue data typein terms of BYTE terminals
longValue returns [long value] throws UnsupportedEncodingException
h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE
$value = (long)
( getUByteValue($h1) ltlt 24 |
getUByteValue($h2) ltlt 16 |
getUByteValue($l1) ltlt 8 |
getUByteValue($l2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uWordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (short)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uByteValue returns [int value]
BYTE
calls the getUByteValue from a BYTE token
16
$value = (int) getUByteValue($BYTE)
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rules used to create the Lexer
FORM FORM
ILBM ILBM
BMHD BMHD
CMAP CMAP
CAMG CAMG
CRNG CRNG
CCRT CCRT
BODY BODY
BYTE
Figure 6 An ANTLR Grammar for the ILBM File Format
622 The Lexer and Parser
When provided the file described in the previous section ANTLR generates the ilbmlexer and
ilbmparser Java programs The Java application shown in Figure 6 uses these programs to
validate the ILBM file BLKIFF It produces the output shown in Figure 7
package orggtrigrammarschunk
import javaioIOException
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRStringStream
import organtlrruntimeCharStream
import organtlrruntimeCommonTokenStream
import organtlrruntimeRecognitionException
17
import organtlrruntimeTokenStream
import gtriorggrammarschunkiblmLexer
import gtriorggrammarschunkiblmParser
public class Test
static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF
public static void main(String[] args) throws RecognitionException IOException
ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)
iblmLexer lexer = new iblmLexer(stream)
TokenStream tokenStream = new CommonTokenStream(lexer)
iblmParser parser = new iblmParser(tokenStream)
Systemoutprintln(FILE + filePath)
Systemoutprintln(Successfully parsed = + parseriblm())
Figure 7 A Java Application for Parsing an ILBM File
FILE CUserssl170PicturesIFF-ilbmBLKIFF
ID ILBM Size 2556
ID BMHD Size 20
Width 200
Height 144
X Position 0
Y Position 0
Number of Panes 6
Masking0
Compression1
Reserved10
TransParent Color0
X Aspest3
Y Aspect3
Page Width200
Page Height144
ID CMAP Size 48
Color[0] 0 0 0
Color[1] 0 128 128
Color[2] 0 255 255
Color[3] 255 0 255
Color[4] 0 255 0
Color[5] 255 0 0
Color[6] 0 0 255
Color[7] 255 255 0
Color[8] 128 128 128
Color[9] 192 192 192
Color[10] 128 0 0
Color[11] 0 128 0
Color[12] 128 128 0
Color[13] 0 0 128
Color[14] 128 0 128
18
Color[15] 255 255 255
ID CAMG Size 4
View Modes 0x800
ID BODY Size 2448
Number of bytes 2448
Successfully parsed = true
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF
63 An ANTLR Grammar for the RIFF Waveform PCM File Format
Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR
The output of ANTLR is a lexer and parser for WAVEFOR PCM files
grammar waveform_pcm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text contained in lexerheader is placed at the top of the lexer
file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Default ByteOrder to Little Endian
String byteOrder = II
Rules used to create the Parser
start symbol waveform_pcm
waveform_pcm returns [boolean result]
19
ckId = fourCC $ckIdvalueequals(RIFF)
ckSize = dWordValue
formType = fourCC $formTypevalueequals(WAVE)
formatSubChunk
dataSubChunk
$result = true
fourCC returns [String value]
BYTE BYTE BYTE BYTE $value = $text
formatSubChunk
subChunkId = fourCC $subChunkIdvalueequals(fmt )
ckSize = dWordValue
commonFields
pcmSpecificField
commonFields
formatTag = wordValue Format category
channels = wordValue Number of channels
samplesPerSec = dWordValue Sampling rate
avgBytesPerSec = dWordValue For buffer estimation
blockAlign = wordValue Data block size
pcmSpecificField
bitsPerSample = wordValue Sample size
dataSubChunk
subChunkId = fourCC $subChunkIdvalueequals(data)
ckSize = dWordValue
waveData[$ckSizevalue]
waveData [long size]
(size gt= 0=gt BYTE size--)+
defines the long data type in terms of BYTE terminals
dWordValue returns [long value] throws UnsupportedEncodingException
b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE
if (byteOrderequals(II))
$value = (long)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 |
getUByteValue($b3) ltlt 16 |
getUByteValue($b4) ltlt 24 )
else
$value = (long)
( getUByteValue($b1) ltlt 24 |
getUByteValue($b2) ltlt 16 |
20
getUByteValue($b3) ltlt 8 |
getUByteValue($b4) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
if (byteOrderequals(II))
$value = (int)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 )
else
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rule used to create the Lexer
BYTE
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format
7 Related Research
Researchers have developed a number of data description languages to support validation of file
formats EAST (Enhanced ADA Subset) is a data description language developed by the
Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary
Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic
information The EAST description is used to interpret and provide access to information in
binary and text files
DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to
manipulate Java jar files and ELF object files The PADSC data description language can be
used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC
compiler compiles the description into tools that can be used to recognize manipulate and
transform the data into other formats
21
The Data Format Description Language (DFDL) is being developed by a Working Group of the
Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML
Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-
XML physical representations Version 1 of the language specification was published in 2011
and a parser is being implemented
Each of the data description languages described above can be used to define data types and file
structures of binary files However they are based on file structure declarations such as those of
C or Java The binary file format grammar described in this paper most closely resembles DFDL
However the binary file grammar described in this paper is the only data description language
based on formal grammars that is used for creating recognizers for file formats
JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to
provide automated and efficient identification and validation of the formats of digital files
JHOVE is a format-specific digital object validation API written in Java JHOVE supports
validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF
UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California
digital Library 2011]
The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation
of binary file formats As a matter of fact binary file grammars and parsers have been
constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-
based format TIFF However the research reported herein differs from that of the JHOVE
project in that what is sought is a technology for generating validators for binary file formats
from grammars specifying the binary file format
8 Conclusion
The research question addressed by his research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files Amajor family of binary file formats chunk-based was described Then extensions
to the concepts of context-free grammars and attribute grammars were described that enable the
specification of binary file formats Examples of attribute array grammars for two chunk-based
binary file format were then presented Recursive descent parsers for this classes of grammars
was then described Experience in using ANTLR a parser generator for LL() string grammars
in generating parsers for recognizing the formats of binary files was discussed Finally related
research in data description languages for binary file formats was described and the JHOVE
project in creating Java-based tools for validating file formats was described
22
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats Furthermore these grammars can be used with recursive descent
parsers for validating the file formats of chunk-based binary files It remains to be determined
whether these attribute grammars based on binary file (array) grammars are adequate to define
other families of binary file formats
To be a practical technology for generating recognizers (validators) for binary file formats a
parser generator is needed that creates from a binary file grammar a recursive descent parser
(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and
generates a lexer as well as a parser is too heavyweight a parser generator for this task and does
not have the requisite data types built in
A capability to validate binary file formats assumes that the file format of a file is already
known In this research we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009] We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendix a We also have file
signature tests (magic tests) for all these file types
We would like to have a parser generator (also called a compilerndashcompiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents The
closest parser generator that we have found to having these capabilities is ANTLR
(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar However it works with strings and generates a separate scanner
Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based
file formats We have specifications for each of file formats and samples of files with these
formats
These file formats have a similar structure We have developed binary file grammars for two of
these binary file formats We should be able to do so for all of them We used these grammars
with the extensions to ANTLR to generate top-down recursive descent parsers for the file
formats defined with these grammars
The significance of success in this research task is that if binary file formats can be specified
with binary file grammars then only one parsing generator is needed to generate the parsers for
verifying that a file conforms to a particular format Similarly the same parsing generator can be
used for conversion of legacy file formats to current or standard formats Finally the same parser
generator can be used for generating viewerplayers for most file formats This would increase
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
4
stored as 10 (4 byte) words at file addresses 0 4 8 hellip36 so that the element with index i has the
address 4 times i
Arrays are used to implement many other data structures such as lists and strings In most
modern computers the internal memory and external memory of storage devices (and files on
those devices) is a one-dimensional array of data types whose indices are their addresses
A data type is a classification of one of various types of data such as floating-point integer
character pointer or Boolean that determines the possible values for that type the operations
that can be performed on values of that type and the way values of that type can be stored Data
types have names such as int16 int32 float char bool ptr We define a binary file to be an array
of values of data of various types
We define a context-free binary file grammar BG as a quintuple ltN D Σ S Pgt where
N is a finite set of non-terminal symbols
D is a set of data types
Σ is a finite set of binary values of data types D called terminals
S ε N is the start symbol
P is a set of production rules of the form N rarr N Σ
Let DataTypes indicate the union of all the values of all datatypes D The set of all binary files is
BinaryFiles = [IndicesrarrDataTypes] where Indices = 1 The language generated by a
context-free array (binary file) grammar AG is the set L(AG) = w w BinaryFiles and S w
By primitive nonterminal is meant a symbol which would be replaced by one or more terminal
symbols were an additional production rule applied In other words a primitive nonterminal is a
nonterminal that appears on the left-hand side of productions which only have one or more
terminal symbols on the right We adopt the notation shown below for primitive nonterminals
ltprimitive_nonterminal_symbol type=datatype_name constant=valuegt
32 Attribute Grammars
Context-free grammars cannot represent context-sensitive aspects of programming languages
such as (1) enforcing the constraint that all variables are declared before they are used or (2)
checking the number of parameters in a function call against the number in the functionlsquos
declaration Context-free grammars have also been used to define the syntax of English and other
natural languages but they cannot define such context-sensitivity as subject verb agreement in
English sentences A number of extensions of context-free grammars have been proposed to
5
address the context-sensitivity of programming languages and natural languages These include
indexed [Aho 1968] recording [Barth 1979] affix [Koster 1971] VW [van Wijngaarden 1974]
and attribute grammars [Knuth 1969 1971]
Context-free grammars also cannot represent the semantics of programming languages Knuth
[1968 1971] proposed an extension of context-free grammars termed attribute grammars that
addresses the semantics as well as the context-sensitivity of programming languages
An attribute grammar GA is a triple ltG A ARgt where G is a context-free grammar for the
language A associates each grammar symbol X isin (N Σ) with a set of attributes and AR
associates each production R isin P with a set of attribute computation rules A(X) where X isin (N
Σ) can be further partitioned into two sets synthesized attributes S(X) and inherited attributes
I(X) AR(R) where R isin P contains rules for computing inherited and synthesized attributes
associated with the symbols in the production R Synthesized attributes are evaluated and passed
up from deeper terminals and non-terminals while inherited attributes are passed down from
upper non-terminals
An L-attributed grammar is an attribute grammar that uses
i) synthesized attributes and
ii) inherited attributes where the inherited attribute depends only on
a) attributes of the other child nodes to its left
b) inherited attributes of P itself
L-attributed grammars allow the attributes to be evaluated in one left-to-right traversal of the
abstract syntax tree As a result attribute evaluation in L-attributed grammars can be
incorporated conveniently in top-down parsing
33 Semantic Grammar
An interpreter for a binary file format that is using a grammar to display or play the contents of a
file or to convert it to another file format needs to produce a semantic representation that is
appropriate to the task Natural language understanding components of dialog systems face a
similar need to produce a semantic representation that is appropriate to the dialogue task To
address this need some developers of dialogue systems make use of an approach called semantic
grammar [Issar and Ward 1993 Ward and Issar 1994] A semantic grammar is a CFG in which
the non-terminals on the left-hand side of rules correspond to semantic entities A parser for a
semantic grammar produces a labeling of the input string with semantic node labels for example
SHOW FLIGHTS ORIGIN DESTINATION DEPART_DATE DEPART_TIME Show me flights from Boston to San Francisco on Tuesday morning
6
A semantic grammar is just what we need to file in the values of a programming language data
structure Semantic grammars are not applicable to generic systems but are limited to specific
domains This is not a problem in specifying or parsing a binary file format since a grammar for
a binary file format is not meant to apply to all binary file formats but just to a single format or
to a family of file formats
4 Grammars for Chunk-based Binary File Formats
The attribute grammar that we present will be a mixture of synthesized semantic rules (attribute
values of the left side of a production are calculated from attribute values of symbols on the right
side) and inherited rules (attribute values of symbols on the right side of a production are
calculated from attribute values of the symbol on the left side)
The field names of the file format spec are represented as nonterminals in the grammar Constant
values in fields are terminals that appear on the right-hand side of productions Nonterminal
fieldnames have data types associated with them
41 InterLeaved BitMap (ILBM)
The author of the Interleaved Bitmap (ILBM) file format recognized that the structure of the file
format could be specified using pseudo EBNF rules and C data structure declarations [Morrison
1986] In this section the ―grammar of the ILBM file format specification is converted to a
context-free binary grammar with synthesized and inherited attributes
Figure 2 shows a display of an ILBM file hampic Figure 3 shows an attribute grammar for its
chunk-based binary file format The attribute grammar is a mixture of synthesized semantic rules
(attribute values of the left side of a production are calculated from attribute values of symbols
on the right side) and inherited rules (attribute values of symbols on the right side of a
production are calculated from attribute values of the symbol on the left side) The attribute
grammar is also an L-attributed grammar Figure 4 shows a parse tree for the file hampic
7
Figure 2 ILBM File hampic Displayed
ltILBMgt rarr ―FORM ltcksize TYPE=UINT32gt ―ILBM ltBMHDgt ltCMAPgt ltCAMGgt ltBODYgt
ILBMcksize = cksizevalue
ltBMHDgt rarr ―BMHD ltcksize TYPE=uint32gt ltBitmapHeadergt
ltBitMapheadergt rarr ltwidth type=UINT32gt
lt height type=UINT16gt
lt xposition type=INT16gt
ltyposition type=INT16gt
ltnplanes type=BYTEgt
ltmasking type=BYTEgt
ltcompression type=BYTEgt
ltreserved type=BYTEgt
lttransparentcolor type=UINT16gt
ltxaspect type=BYTEgt
ltyaspect type=BYTEgt
ltpagewidth type=INT16gt
ltpageheight type=INT16gt
ltCMAPgt -rarr―CMAP ltcksize type=UINT32gt ltcolorgt[n] n=cksizeval3
ltCAMGgt rarr ―CAMG ltcksizetype=UINT32gt ltviewmode type=INT32gt
ltBODYgt rarr ―BODY ltcksize type=UINT32gt ltdata type=BYTEgt[cksize]
ltcolorgt rarr ltred type=BYTEgt ltgreen type= BYTEgt ltblue type=BYTEgt
Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format
8
Figure 4 A Parse Tree for the ILBM File hampic
9
42 RIFF Waveform PCM Audio File Format (WAVE)
The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp
Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF
rules with the cksize inserted To convert this grammar into a binary file format grammar data
types are added for primitive non-terminals
ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE
ltfmt-ckgt
[ltfact-ckgt]
[ltcue-ckgt]
[ltplaylist-ckgt]
[ltassoc-data-listgt]
ltwave-datagt
ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt
ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt
ltwChannelsWORDgt
ltdwSamplesPerSecDWORDgt
ltdwAvgBytesPerSecDWORDgt
ltwBlockAlignWORDgt
ltwave-datagt -gt ltdata-ckgt | ltwave-listgt
ltdata-ckgt -gt data cksize ltwave-datagt
ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt
ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt
ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt
ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+
ltcue-pointgt -gt ltdwNameDWORDgt
ltdwPositionDWORDgt
ltfccChunkFOURCCgt
ltdwChunkStartDWORDgt
ltdwBlockStartDWORDgt
ltdwSampleOffsetDWORDgt
ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+
ltplay-segmentgt -gt ltdwNameDWORDgt
ltdwLengthDWORDgt
ltdwLoopsDWORD
ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt
ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt
ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt
10
ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt
ltdwSampleLengthDWORDgt
ltdwPurposeDWORDgt
ltwCountryWORDgt
ltwLanguageWORDgt
ltwDialectWORDgt
ltwCodePageWORDgt
ltdataBYTEgt+
ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt
ltfileDataBYTEgt+
Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format
5 Recursive Descent Parsers for Binary File Grammars
A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)
procedures where each procedure implements one of the production rules of the grammar A
predictive parser is a recursive descent parser that does not require backtracking Predictive
parsing is possible only for the class of LL(k) grammars which are the context-free grammars
for which there exists some positive integer k that allows a recursive descent parser to decide
which production to use by examining only the next k tokens of input A recursive descent parser
for the binary file format grammars defined in this paper does not require a separate lexer to
identify the data of a file The data types are predicted by the grammar rules
The semantic rules of an L-attributed grammar can be included in the recursive descent parser to
check for context-sensitive aspects of the grammar such as chunk size or image dimensions and
use these values in parsing the chunk data or images The pointers to file addresses that occur in
the file formats are handled by pushing the current address in the file onto a pushdown stack
When the parser exhausts the procedures implementing production rules corresponding to the
pointed to location the topmost address on the pushdown stack is popped and the procedure
corresponding to the nonterminal at that position is executed
Recursive descent parsers with a pushdown stack can be used with binary file format grammars
to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads
a file and generates an error if the file does not conform to the syntax specified by the grammar
11
6 A Parser Generator for Recognizing Binary File Formats
What we want is a parser generator whose input is a binary file attribute grammar for a particular
binary file format and whose generated output is the Java source code of a recursive descent
parser (with a pushdown stack if needed) for the class of binary file formats specified by the
grammar Such a parser generator does not yet exist but there is a widely used parser generator
for attribute grammars for string-based languages
61 ANTLR
ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing
[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a
language and generates as output source code for a recognizer for that language ANTLR
supports generating code in a number of the programming languages including C Java
JavaScript and Python
ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that
converts a sequence of characters into tokens A lexer is not needed to parse binary file formats
because the data types predicted by the parser are the tokens of the grammar Furthermore the
capability to recognize LL() grammars is not needed because the binary file format grammars
specified so far do not require lookahead of k symbols
62 Using ANTLR to Generate a Parser for a Binary File Format
It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-
based and directory-based binary file formats This is accomplished by writing a lexer that treats
each input byte in a file as a character token Then functions are created for each data type in the
binary file grammar that convert the appropriate number of character tokens to binary data types
eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our
attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a
parser generator for binary file grammars
621 An ANTLR Grammar for an ILBM Chunk-based File Format
Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of
ANTLR is a lexer and parser for ILBM files
12
grammar iblm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in lexerheader is placed at the top of the lexer file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Grammar Rules used to create the Parser
Start symbol ilbm
iblm returns [boolean result]
FORM ckSize ILBM
Print out ILBM chunksize
Systemoutprintln(ID ILBM Size + $ckSizevalue)
Semantic predicate that ensures that at least on BMHD is found
propertyChunks $propertyChunksfoundBMHD == true
dataChunks
body
$result = true
The bmhd cmap and camg property chunks can occur in any order
At least one BMHD data chunk must be present
propertyChunks returns [boolean foundBMHD]
(bmhd $foundBMHD = true
| cmap
| camg)+
bmhd
BMHD ckSize
Systemoutprintln(ID BMHD Size + $ckSizevalue)
13
bitMapHeader
bitMapHeader
width = uWordValue
height = uWordValue
xPosition = wordValue
yPosition = wordValue
nPanes = uByteValue
masking = uByteValue
compression = uByteValue
reserved1 = uByteValue
transparentColor = uWordValue
xAspect = uByteValue
yAspect = uByteValue
pageWidth = wordValue
pageHeight = wordValue
Print out values in the bitmap structure
Systemoutprintln( Width + $widthvalue)
Systemoutprintln( Height + $heightvalue)
Systemoutprintln( X Position + $xPositionvalue)
Systemoutprintln( Y Position + $yPositionvalue)
Systemoutprintln( Number of Panes + $nPanesvalue)
Systemoutprintln( Masking + $maskingvalue)
Systemoutprintln( Compression + $compressionvalue)
Systemoutprintln( Reserved1 + $reserved1value)
Systemoutprintln( TransParent Color +
$transparentColorvalue)
Systemoutprintln( X Aspest + $xAspectvalue)
Systemoutprintln( Y Aspect + $yAspectvalue)
Systemoutprintln( Page Width + $pageWidthvalue)
Systemoutprintln( Page Height + $pageHeightvalue)
cmap returns [long value]
Initialize the index value i in the cmap function
init int i = 0
CMAP ckSize
Print out chunk size
$value = $ckSizevalue
Systemoutprintln(ID CMAP Size + $ckSizevalue)
(colorRegister
Print out color values
Systemoutprintln( Color[ + i++ + ] +
$colorRegisterredValue +
+ $colorRegistergreenValue +
+ $colorRegisterblueValue +
)
)+
colorRegister returns [int redValue int greenValue int blueValue]
14
red = uByteValue green = uByteValue blue = uByteValue
$redValue = $redvalue
$greenValue = $greenvalue
$blueValue = $bluevalue
camg
CAMG ckSize longValue
Systemoutprintln(ID CAMG Size + $ckSizevalue)
Systemoutprintln( View Modes 0x +
LongtoHexString($longValuevalue))
The crng and ccrt data chunks
dataChunks
crng
| ccrt
crng
CRNG ckSize
Systemoutprintln(ID CRNG Size + $ckSizevalue)
cRange
cRange
pad1 = wordValue $pad1value == 0
rate = wordValue
active = wordValue
low = uByteValue
high = uByteValue
ccrt
CCRT ckSize
Systemoutprintln(ID CCRT Size + $ckSizevalue)
cycleInfo
cycleInfo
direction = wordValue
start = uByteValue
end = uByteValue
seconds = longValue
microseconds = longValue
pad = wordValue $padvalue == 0
body
BODY ckSize
Systemoutprintln(ID BODY Size + $ckSizevalue)
$ckSizevalue gt 0 data
Systemoutprintln( Number of bytes + $datavalue)
$datavalue == $ckSizevalue
defines the data chunk of the body chunk
data returns [long value]
(b+=BYTE)+ $bsize() gt 0
$value = $bsize()
15
The cksize is the value of a long integer
ckSize returns [long value]
longValue
$value = $longValuevalue
defines the longValue data typein terms of BYTE terminals
longValue returns [long value] throws UnsupportedEncodingException
h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE
$value = (long)
( getUByteValue($h1) ltlt 24 |
getUByteValue($h2) ltlt 16 |
getUByteValue($l1) ltlt 8 |
getUByteValue($l2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uWordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (short)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uByteValue returns [int value]
BYTE
calls the getUByteValue from a BYTE token
16
$value = (int) getUByteValue($BYTE)
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rules used to create the Lexer
FORM FORM
ILBM ILBM
BMHD BMHD
CMAP CMAP
CAMG CAMG
CRNG CRNG
CCRT CCRT
BODY BODY
BYTE
Figure 6 An ANTLR Grammar for the ILBM File Format
622 The Lexer and Parser
When provided the file described in the previous section ANTLR generates the ilbmlexer and
ilbmparser Java programs The Java application shown in Figure 6 uses these programs to
validate the ILBM file BLKIFF It produces the output shown in Figure 7
package orggtrigrammarschunk
import javaioIOException
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRStringStream
import organtlrruntimeCharStream
import organtlrruntimeCommonTokenStream
import organtlrruntimeRecognitionException
17
import organtlrruntimeTokenStream
import gtriorggrammarschunkiblmLexer
import gtriorggrammarschunkiblmParser
public class Test
static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF
public static void main(String[] args) throws RecognitionException IOException
ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)
iblmLexer lexer = new iblmLexer(stream)
TokenStream tokenStream = new CommonTokenStream(lexer)
iblmParser parser = new iblmParser(tokenStream)
Systemoutprintln(FILE + filePath)
Systemoutprintln(Successfully parsed = + parseriblm())
Figure 7 A Java Application for Parsing an ILBM File
FILE CUserssl170PicturesIFF-ilbmBLKIFF
ID ILBM Size 2556
ID BMHD Size 20
Width 200
Height 144
X Position 0
Y Position 0
Number of Panes 6
Masking0
Compression1
Reserved10
TransParent Color0
X Aspest3
Y Aspect3
Page Width200
Page Height144
ID CMAP Size 48
Color[0] 0 0 0
Color[1] 0 128 128
Color[2] 0 255 255
Color[3] 255 0 255
Color[4] 0 255 0
Color[5] 255 0 0
Color[6] 0 0 255
Color[7] 255 255 0
Color[8] 128 128 128
Color[9] 192 192 192
Color[10] 128 0 0
Color[11] 0 128 0
Color[12] 128 128 0
Color[13] 0 0 128
Color[14] 128 0 128
18
Color[15] 255 255 255
ID CAMG Size 4
View Modes 0x800
ID BODY Size 2448
Number of bytes 2448
Successfully parsed = true
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF
63 An ANTLR Grammar for the RIFF Waveform PCM File Format
Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR
The output of ANTLR is a lexer and parser for WAVEFOR PCM files
grammar waveform_pcm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text contained in lexerheader is placed at the top of the lexer
file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Default ByteOrder to Little Endian
String byteOrder = II
Rules used to create the Parser
start symbol waveform_pcm
waveform_pcm returns [boolean result]
19
ckId = fourCC $ckIdvalueequals(RIFF)
ckSize = dWordValue
formType = fourCC $formTypevalueequals(WAVE)
formatSubChunk
dataSubChunk
$result = true
fourCC returns [String value]
BYTE BYTE BYTE BYTE $value = $text
formatSubChunk
subChunkId = fourCC $subChunkIdvalueequals(fmt )
ckSize = dWordValue
commonFields
pcmSpecificField
commonFields
formatTag = wordValue Format category
channels = wordValue Number of channels
samplesPerSec = dWordValue Sampling rate
avgBytesPerSec = dWordValue For buffer estimation
blockAlign = wordValue Data block size
pcmSpecificField
bitsPerSample = wordValue Sample size
dataSubChunk
subChunkId = fourCC $subChunkIdvalueequals(data)
ckSize = dWordValue
waveData[$ckSizevalue]
waveData [long size]
(size gt= 0=gt BYTE size--)+
defines the long data type in terms of BYTE terminals
dWordValue returns [long value] throws UnsupportedEncodingException
b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE
if (byteOrderequals(II))
$value = (long)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 |
getUByteValue($b3) ltlt 16 |
getUByteValue($b4) ltlt 24 )
else
$value = (long)
( getUByteValue($b1) ltlt 24 |
getUByteValue($b2) ltlt 16 |
20
getUByteValue($b3) ltlt 8 |
getUByteValue($b4) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
if (byteOrderequals(II))
$value = (int)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 )
else
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rule used to create the Lexer
BYTE
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format
7 Related Research
Researchers have developed a number of data description languages to support validation of file
formats EAST (Enhanced ADA Subset) is a data description language developed by the
Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary
Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic
information The EAST description is used to interpret and provide access to information in
binary and text files
DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to
manipulate Java jar files and ELF object files The PADSC data description language can be
used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC
compiler compiles the description into tools that can be used to recognize manipulate and
transform the data into other formats
21
The Data Format Description Language (DFDL) is being developed by a Working Group of the
Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML
Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-
XML physical representations Version 1 of the language specification was published in 2011
and a parser is being implemented
Each of the data description languages described above can be used to define data types and file
structures of binary files However they are based on file structure declarations such as those of
C or Java The binary file format grammar described in this paper most closely resembles DFDL
However the binary file grammar described in this paper is the only data description language
based on formal grammars that is used for creating recognizers for file formats
JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to
provide automated and efficient identification and validation of the formats of digital files
JHOVE is a format-specific digital object validation API written in Java JHOVE supports
validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF
UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California
digital Library 2011]
The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation
of binary file formats As a matter of fact binary file grammars and parsers have been
constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-
based format TIFF However the research reported herein differs from that of the JHOVE
project in that what is sought is a technology for generating validators for binary file formats
from grammars specifying the binary file format
8 Conclusion
The research question addressed by his research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files Amajor family of binary file formats chunk-based was described Then extensions
to the concepts of context-free grammars and attribute grammars were described that enable the
specification of binary file formats Examples of attribute array grammars for two chunk-based
binary file format were then presented Recursive descent parsers for this classes of grammars
was then described Experience in using ANTLR a parser generator for LL() string grammars
in generating parsers for recognizing the formats of binary files was discussed Finally related
research in data description languages for binary file formats was described and the JHOVE
project in creating Java-based tools for validating file formats was described
22
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats Furthermore these grammars can be used with recursive descent
parsers for validating the file formats of chunk-based binary files It remains to be determined
whether these attribute grammars based on binary file (array) grammars are adequate to define
other families of binary file formats
To be a practical technology for generating recognizers (validators) for binary file formats a
parser generator is needed that creates from a binary file grammar a recursive descent parser
(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and
generates a lexer as well as a parser is too heavyweight a parser generator for this task and does
not have the requisite data types built in
A capability to validate binary file formats assumes that the file format of a file is already
known In this research we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009] We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendix a We also have file
signature tests (magic tests) for all these file types
We would like to have a parser generator (also called a compilerndashcompiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents The
closest parser generator that we have found to having these capabilities is ANTLR
(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar However it works with strings and generates a separate scanner
Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based
file formats We have specifications for each of file formats and samples of files with these
formats
These file formats have a similar structure We have developed binary file grammars for two of
these binary file formats We should be able to do so for all of them We used these grammars
with the extensions to ANTLR to generate top-down recursive descent parsers for the file
formats defined with these grammars
The significance of success in this research task is that if binary file formats can be specified
with binary file grammars then only one parsing generator is needed to generate the parsers for
verifying that a file conforms to a particular format Similarly the same parsing generator can be
used for conversion of legacy file formats to current or standard formats Finally the same parser
generator can be used for generating viewerplayers for most file formats This would increase
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
5
address the context-sensitivity of programming languages and natural languages These include
indexed [Aho 1968] recording [Barth 1979] affix [Koster 1971] VW [van Wijngaarden 1974]
and attribute grammars [Knuth 1969 1971]
Context-free grammars also cannot represent the semantics of programming languages Knuth
[1968 1971] proposed an extension of context-free grammars termed attribute grammars that
addresses the semantics as well as the context-sensitivity of programming languages
An attribute grammar GA is a triple ltG A ARgt where G is a context-free grammar for the
language A associates each grammar symbol X isin (N Σ) with a set of attributes and AR
associates each production R isin P with a set of attribute computation rules A(X) where X isin (N
Σ) can be further partitioned into two sets synthesized attributes S(X) and inherited attributes
I(X) AR(R) where R isin P contains rules for computing inherited and synthesized attributes
associated with the symbols in the production R Synthesized attributes are evaluated and passed
up from deeper terminals and non-terminals while inherited attributes are passed down from
upper non-terminals
An L-attributed grammar is an attribute grammar that uses
i) synthesized attributes and
ii) inherited attributes where the inherited attribute depends only on
a) attributes of the other child nodes to its left
b) inherited attributes of P itself
L-attributed grammars allow the attributes to be evaluated in one left-to-right traversal of the
abstract syntax tree As a result attribute evaluation in L-attributed grammars can be
incorporated conveniently in top-down parsing
33 Semantic Grammar
An interpreter for a binary file format that is using a grammar to display or play the contents of a
file or to convert it to another file format needs to produce a semantic representation that is
appropriate to the task Natural language understanding components of dialog systems face a
similar need to produce a semantic representation that is appropriate to the dialogue task To
address this need some developers of dialogue systems make use of an approach called semantic
grammar [Issar and Ward 1993 Ward and Issar 1994] A semantic grammar is a CFG in which
the non-terminals on the left-hand side of rules correspond to semantic entities A parser for a
semantic grammar produces a labeling of the input string with semantic node labels for example
SHOW FLIGHTS ORIGIN DESTINATION DEPART_DATE DEPART_TIME Show me flights from Boston to San Francisco on Tuesday morning
6
A semantic grammar is just what we need to file in the values of a programming language data
structure Semantic grammars are not applicable to generic systems but are limited to specific
domains This is not a problem in specifying or parsing a binary file format since a grammar for
a binary file format is not meant to apply to all binary file formats but just to a single format or
to a family of file formats
4 Grammars for Chunk-based Binary File Formats
The attribute grammar that we present will be a mixture of synthesized semantic rules (attribute
values of the left side of a production are calculated from attribute values of symbols on the right
side) and inherited rules (attribute values of symbols on the right side of a production are
calculated from attribute values of the symbol on the left side)
The field names of the file format spec are represented as nonterminals in the grammar Constant
values in fields are terminals that appear on the right-hand side of productions Nonterminal
fieldnames have data types associated with them
41 InterLeaved BitMap (ILBM)
The author of the Interleaved Bitmap (ILBM) file format recognized that the structure of the file
format could be specified using pseudo EBNF rules and C data structure declarations [Morrison
1986] In this section the ―grammar of the ILBM file format specification is converted to a
context-free binary grammar with synthesized and inherited attributes
Figure 2 shows a display of an ILBM file hampic Figure 3 shows an attribute grammar for its
chunk-based binary file format The attribute grammar is a mixture of synthesized semantic rules
(attribute values of the left side of a production are calculated from attribute values of symbols
on the right side) and inherited rules (attribute values of symbols on the right side of a
production are calculated from attribute values of the symbol on the left side) The attribute
grammar is also an L-attributed grammar Figure 4 shows a parse tree for the file hampic
7
Figure 2 ILBM File hampic Displayed
ltILBMgt rarr ―FORM ltcksize TYPE=UINT32gt ―ILBM ltBMHDgt ltCMAPgt ltCAMGgt ltBODYgt
ILBMcksize = cksizevalue
ltBMHDgt rarr ―BMHD ltcksize TYPE=uint32gt ltBitmapHeadergt
ltBitMapheadergt rarr ltwidth type=UINT32gt
lt height type=UINT16gt
lt xposition type=INT16gt
ltyposition type=INT16gt
ltnplanes type=BYTEgt
ltmasking type=BYTEgt
ltcompression type=BYTEgt
ltreserved type=BYTEgt
lttransparentcolor type=UINT16gt
ltxaspect type=BYTEgt
ltyaspect type=BYTEgt
ltpagewidth type=INT16gt
ltpageheight type=INT16gt
ltCMAPgt -rarr―CMAP ltcksize type=UINT32gt ltcolorgt[n] n=cksizeval3
ltCAMGgt rarr ―CAMG ltcksizetype=UINT32gt ltviewmode type=INT32gt
ltBODYgt rarr ―BODY ltcksize type=UINT32gt ltdata type=BYTEgt[cksize]
ltcolorgt rarr ltred type=BYTEgt ltgreen type= BYTEgt ltblue type=BYTEgt
Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format
8
Figure 4 A Parse Tree for the ILBM File hampic
9
42 RIFF Waveform PCM Audio File Format (WAVE)
The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp
Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF
rules with the cksize inserted To convert this grammar into a binary file format grammar data
types are added for primitive non-terminals
ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE
ltfmt-ckgt
[ltfact-ckgt]
[ltcue-ckgt]
[ltplaylist-ckgt]
[ltassoc-data-listgt]
ltwave-datagt
ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt
ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt
ltwChannelsWORDgt
ltdwSamplesPerSecDWORDgt
ltdwAvgBytesPerSecDWORDgt
ltwBlockAlignWORDgt
ltwave-datagt -gt ltdata-ckgt | ltwave-listgt
ltdata-ckgt -gt data cksize ltwave-datagt
ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt
ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt
ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt
ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+
ltcue-pointgt -gt ltdwNameDWORDgt
ltdwPositionDWORDgt
ltfccChunkFOURCCgt
ltdwChunkStartDWORDgt
ltdwBlockStartDWORDgt
ltdwSampleOffsetDWORDgt
ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+
ltplay-segmentgt -gt ltdwNameDWORDgt
ltdwLengthDWORDgt
ltdwLoopsDWORD
ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt
ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt
ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt
10
ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt
ltdwSampleLengthDWORDgt
ltdwPurposeDWORDgt
ltwCountryWORDgt
ltwLanguageWORDgt
ltwDialectWORDgt
ltwCodePageWORDgt
ltdataBYTEgt+
ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt
ltfileDataBYTEgt+
Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format
5 Recursive Descent Parsers for Binary File Grammars
A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)
procedures where each procedure implements one of the production rules of the grammar A
predictive parser is a recursive descent parser that does not require backtracking Predictive
parsing is possible only for the class of LL(k) grammars which are the context-free grammars
for which there exists some positive integer k that allows a recursive descent parser to decide
which production to use by examining only the next k tokens of input A recursive descent parser
for the binary file format grammars defined in this paper does not require a separate lexer to
identify the data of a file The data types are predicted by the grammar rules
The semantic rules of an L-attributed grammar can be included in the recursive descent parser to
check for context-sensitive aspects of the grammar such as chunk size or image dimensions and
use these values in parsing the chunk data or images The pointers to file addresses that occur in
the file formats are handled by pushing the current address in the file onto a pushdown stack
When the parser exhausts the procedures implementing production rules corresponding to the
pointed to location the topmost address on the pushdown stack is popped and the procedure
corresponding to the nonterminal at that position is executed
Recursive descent parsers with a pushdown stack can be used with binary file format grammars
to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads
a file and generates an error if the file does not conform to the syntax specified by the grammar
11
6 A Parser Generator for Recognizing Binary File Formats
What we want is a parser generator whose input is a binary file attribute grammar for a particular
binary file format and whose generated output is the Java source code of a recursive descent
parser (with a pushdown stack if needed) for the class of binary file formats specified by the
grammar Such a parser generator does not yet exist but there is a widely used parser generator
for attribute grammars for string-based languages
61 ANTLR
ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing
[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a
language and generates as output source code for a recognizer for that language ANTLR
supports generating code in a number of the programming languages including C Java
JavaScript and Python
ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that
converts a sequence of characters into tokens A lexer is not needed to parse binary file formats
because the data types predicted by the parser are the tokens of the grammar Furthermore the
capability to recognize LL() grammars is not needed because the binary file format grammars
specified so far do not require lookahead of k symbols
62 Using ANTLR to Generate a Parser for a Binary File Format
It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-
based and directory-based binary file formats This is accomplished by writing a lexer that treats
each input byte in a file as a character token Then functions are created for each data type in the
binary file grammar that convert the appropriate number of character tokens to binary data types
eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our
attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a
parser generator for binary file grammars
621 An ANTLR Grammar for an ILBM Chunk-based File Format
Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of
ANTLR is a lexer and parser for ILBM files
12
grammar iblm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in lexerheader is placed at the top of the lexer file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Grammar Rules used to create the Parser
Start symbol ilbm
iblm returns [boolean result]
FORM ckSize ILBM
Print out ILBM chunksize
Systemoutprintln(ID ILBM Size + $ckSizevalue)
Semantic predicate that ensures that at least on BMHD is found
propertyChunks $propertyChunksfoundBMHD == true
dataChunks
body
$result = true
The bmhd cmap and camg property chunks can occur in any order
At least one BMHD data chunk must be present
propertyChunks returns [boolean foundBMHD]
(bmhd $foundBMHD = true
| cmap
| camg)+
bmhd
BMHD ckSize
Systemoutprintln(ID BMHD Size + $ckSizevalue)
13
bitMapHeader
bitMapHeader
width = uWordValue
height = uWordValue
xPosition = wordValue
yPosition = wordValue
nPanes = uByteValue
masking = uByteValue
compression = uByteValue
reserved1 = uByteValue
transparentColor = uWordValue
xAspect = uByteValue
yAspect = uByteValue
pageWidth = wordValue
pageHeight = wordValue
Print out values in the bitmap structure
Systemoutprintln( Width + $widthvalue)
Systemoutprintln( Height + $heightvalue)
Systemoutprintln( X Position + $xPositionvalue)
Systemoutprintln( Y Position + $yPositionvalue)
Systemoutprintln( Number of Panes + $nPanesvalue)
Systemoutprintln( Masking + $maskingvalue)
Systemoutprintln( Compression + $compressionvalue)
Systemoutprintln( Reserved1 + $reserved1value)
Systemoutprintln( TransParent Color +
$transparentColorvalue)
Systemoutprintln( X Aspest + $xAspectvalue)
Systemoutprintln( Y Aspect + $yAspectvalue)
Systemoutprintln( Page Width + $pageWidthvalue)
Systemoutprintln( Page Height + $pageHeightvalue)
cmap returns [long value]
Initialize the index value i in the cmap function
init int i = 0
CMAP ckSize
Print out chunk size
$value = $ckSizevalue
Systemoutprintln(ID CMAP Size + $ckSizevalue)
(colorRegister
Print out color values
Systemoutprintln( Color[ + i++ + ] +
$colorRegisterredValue +
+ $colorRegistergreenValue +
+ $colorRegisterblueValue +
)
)+
colorRegister returns [int redValue int greenValue int blueValue]
14
red = uByteValue green = uByteValue blue = uByteValue
$redValue = $redvalue
$greenValue = $greenvalue
$blueValue = $bluevalue
camg
CAMG ckSize longValue
Systemoutprintln(ID CAMG Size + $ckSizevalue)
Systemoutprintln( View Modes 0x +
LongtoHexString($longValuevalue))
The crng and ccrt data chunks
dataChunks
crng
| ccrt
crng
CRNG ckSize
Systemoutprintln(ID CRNG Size + $ckSizevalue)
cRange
cRange
pad1 = wordValue $pad1value == 0
rate = wordValue
active = wordValue
low = uByteValue
high = uByteValue
ccrt
CCRT ckSize
Systemoutprintln(ID CCRT Size + $ckSizevalue)
cycleInfo
cycleInfo
direction = wordValue
start = uByteValue
end = uByteValue
seconds = longValue
microseconds = longValue
pad = wordValue $padvalue == 0
body
BODY ckSize
Systemoutprintln(ID BODY Size + $ckSizevalue)
$ckSizevalue gt 0 data
Systemoutprintln( Number of bytes + $datavalue)
$datavalue == $ckSizevalue
defines the data chunk of the body chunk
data returns [long value]
(b+=BYTE)+ $bsize() gt 0
$value = $bsize()
15
The cksize is the value of a long integer
ckSize returns [long value]
longValue
$value = $longValuevalue
defines the longValue data typein terms of BYTE terminals
longValue returns [long value] throws UnsupportedEncodingException
h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE
$value = (long)
( getUByteValue($h1) ltlt 24 |
getUByteValue($h2) ltlt 16 |
getUByteValue($l1) ltlt 8 |
getUByteValue($l2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uWordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (short)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uByteValue returns [int value]
BYTE
calls the getUByteValue from a BYTE token
16
$value = (int) getUByteValue($BYTE)
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rules used to create the Lexer
FORM FORM
ILBM ILBM
BMHD BMHD
CMAP CMAP
CAMG CAMG
CRNG CRNG
CCRT CCRT
BODY BODY
BYTE
Figure 6 An ANTLR Grammar for the ILBM File Format
622 The Lexer and Parser
When provided the file described in the previous section ANTLR generates the ilbmlexer and
ilbmparser Java programs The Java application shown in Figure 6 uses these programs to
validate the ILBM file BLKIFF It produces the output shown in Figure 7
package orggtrigrammarschunk
import javaioIOException
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRStringStream
import organtlrruntimeCharStream
import organtlrruntimeCommonTokenStream
import organtlrruntimeRecognitionException
17
import organtlrruntimeTokenStream
import gtriorggrammarschunkiblmLexer
import gtriorggrammarschunkiblmParser
public class Test
static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF
public static void main(String[] args) throws RecognitionException IOException
ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)
iblmLexer lexer = new iblmLexer(stream)
TokenStream tokenStream = new CommonTokenStream(lexer)
iblmParser parser = new iblmParser(tokenStream)
Systemoutprintln(FILE + filePath)
Systemoutprintln(Successfully parsed = + parseriblm())
Figure 7 A Java Application for Parsing an ILBM File
FILE CUserssl170PicturesIFF-ilbmBLKIFF
ID ILBM Size 2556
ID BMHD Size 20
Width 200
Height 144
X Position 0
Y Position 0
Number of Panes 6
Masking0
Compression1
Reserved10
TransParent Color0
X Aspest3
Y Aspect3
Page Width200
Page Height144
ID CMAP Size 48
Color[0] 0 0 0
Color[1] 0 128 128
Color[2] 0 255 255
Color[3] 255 0 255
Color[4] 0 255 0
Color[5] 255 0 0
Color[6] 0 0 255
Color[7] 255 255 0
Color[8] 128 128 128
Color[9] 192 192 192
Color[10] 128 0 0
Color[11] 0 128 0
Color[12] 128 128 0
Color[13] 0 0 128
Color[14] 128 0 128
18
Color[15] 255 255 255
ID CAMG Size 4
View Modes 0x800
ID BODY Size 2448
Number of bytes 2448
Successfully parsed = true
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF
63 An ANTLR Grammar for the RIFF Waveform PCM File Format
Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR
The output of ANTLR is a lexer and parser for WAVEFOR PCM files
grammar waveform_pcm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text contained in lexerheader is placed at the top of the lexer
file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Default ByteOrder to Little Endian
String byteOrder = II
Rules used to create the Parser
start symbol waveform_pcm
waveform_pcm returns [boolean result]
19
ckId = fourCC $ckIdvalueequals(RIFF)
ckSize = dWordValue
formType = fourCC $formTypevalueequals(WAVE)
formatSubChunk
dataSubChunk
$result = true
fourCC returns [String value]
BYTE BYTE BYTE BYTE $value = $text
formatSubChunk
subChunkId = fourCC $subChunkIdvalueequals(fmt )
ckSize = dWordValue
commonFields
pcmSpecificField
commonFields
formatTag = wordValue Format category
channels = wordValue Number of channels
samplesPerSec = dWordValue Sampling rate
avgBytesPerSec = dWordValue For buffer estimation
blockAlign = wordValue Data block size
pcmSpecificField
bitsPerSample = wordValue Sample size
dataSubChunk
subChunkId = fourCC $subChunkIdvalueequals(data)
ckSize = dWordValue
waveData[$ckSizevalue]
waveData [long size]
(size gt= 0=gt BYTE size--)+
defines the long data type in terms of BYTE terminals
dWordValue returns [long value] throws UnsupportedEncodingException
b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE
if (byteOrderequals(II))
$value = (long)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 |
getUByteValue($b3) ltlt 16 |
getUByteValue($b4) ltlt 24 )
else
$value = (long)
( getUByteValue($b1) ltlt 24 |
getUByteValue($b2) ltlt 16 |
20
getUByteValue($b3) ltlt 8 |
getUByteValue($b4) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
if (byteOrderequals(II))
$value = (int)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 )
else
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rule used to create the Lexer
BYTE
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format
7 Related Research
Researchers have developed a number of data description languages to support validation of file
formats EAST (Enhanced ADA Subset) is a data description language developed by the
Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary
Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic
information The EAST description is used to interpret and provide access to information in
binary and text files
DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to
manipulate Java jar files and ELF object files The PADSC data description language can be
used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC
compiler compiles the description into tools that can be used to recognize manipulate and
transform the data into other formats
21
The Data Format Description Language (DFDL) is being developed by a Working Group of the
Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML
Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-
XML physical representations Version 1 of the language specification was published in 2011
and a parser is being implemented
Each of the data description languages described above can be used to define data types and file
structures of binary files However they are based on file structure declarations such as those of
C or Java The binary file format grammar described in this paper most closely resembles DFDL
However the binary file grammar described in this paper is the only data description language
based on formal grammars that is used for creating recognizers for file formats
JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to
provide automated and efficient identification and validation of the formats of digital files
JHOVE is a format-specific digital object validation API written in Java JHOVE supports
validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF
UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California
digital Library 2011]
The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation
of binary file formats As a matter of fact binary file grammars and parsers have been
constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-
based format TIFF However the research reported herein differs from that of the JHOVE
project in that what is sought is a technology for generating validators for binary file formats
from grammars specifying the binary file format
8 Conclusion
The research question addressed by his research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files Amajor family of binary file formats chunk-based was described Then extensions
to the concepts of context-free grammars and attribute grammars were described that enable the
specification of binary file formats Examples of attribute array grammars for two chunk-based
binary file format were then presented Recursive descent parsers for this classes of grammars
was then described Experience in using ANTLR a parser generator for LL() string grammars
in generating parsers for recognizing the formats of binary files was discussed Finally related
research in data description languages for binary file formats was described and the JHOVE
project in creating Java-based tools for validating file formats was described
22
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats Furthermore these grammars can be used with recursive descent
parsers for validating the file formats of chunk-based binary files It remains to be determined
whether these attribute grammars based on binary file (array) grammars are adequate to define
other families of binary file formats
To be a practical technology for generating recognizers (validators) for binary file formats a
parser generator is needed that creates from a binary file grammar a recursive descent parser
(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and
generates a lexer as well as a parser is too heavyweight a parser generator for this task and does
not have the requisite data types built in
A capability to validate binary file formats assumes that the file format of a file is already
known In this research we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009] We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendix a We also have file
signature tests (magic tests) for all these file types
We would like to have a parser generator (also called a compilerndashcompiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents The
closest parser generator that we have found to having these capabilities is ANTLR
(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar However it works with strings and generates a separate scanner
Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based
file formats We have specifications for each of file formats and samples of files with these
formats
These file formats have a similar structure We have developed binary file grammars for two of
these binary file formats We should be able to do so for all of them We used these grammars
with the extensions to ANTLR to generate top-down recursive descent parsers for the file
formats defined with these grammars
The significance of success in this research task is that if binary file formats can be specified
with binary file grammars then only one parsing generator is needed to generate the parsers for
verifying that a file conforms to a particular format Similarly the same parsing generator can be
used for conversion of legacy file formats to current or standard formats Finally the same parser
generator can be used for generating viewerplayers for most file formats This would increase
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
6
A semantic grammar is just what we need to file in the values of a programming language data
structure Semantic grammars are not applicable to generic systems but are limited to specific
domains This is not a problem in specifying or parsing a binary file format since a grammar for
a binary file format is not meant to apply to all binary file formats but just to a single format or
to a family of file formats
4 Grammars for Chunk-based Binary File Formats
The attribute grammar that we present will be a mixture of synthesized semantic rules (attribute
values of the left side of a production are calculated from attribute values of symbols on the right
side) and inherited rules (attribute values of symbols on the right side of a production are
calculated from attribute values of the symbol on the left side)
The field names of the file format spec are represented as nonterminals in the grammar Constant
values in fields are terminals that appear on the right-hand side of productions Nonterminal
fieldnames have data types associated with them
41 InterLeaved BitMap (ILBM)
The author of the Interleaved Bitmap (ILBM) file format recognized that the structure of the file
format could be specified using pseudo EBNF rules and C data structure declarations [Morrison
1986] In this section the ―grammar of the ILBM file format specification is converted to a
context-free binary grammar with synthesized and inherited attributes
Figure 2 shows a display of an ILBM file hampic Figure 3 shows an attribute grammar for its
chunk-based binary file format The attribute grammar is a mixture of synthesized semantic rules
(attribute values of the left side of a production are calculated from attribute values of symbols
on the right side) and inherited rules (attribute values of symbols on the right side of a
production are calculated from attribute values of the symbol on the left side) The attribute
grammar is also an L-attributed grammar Figure 4 shows a parse tree for the file hampic
7
Figure 2 ILBM File hampic Displayed
ltILBMgt rarr ―FORM ltcksize TYPE=UINT32gt ―ILBM ltBMHDgt ltCMAPgt ltCAMGgt ltBODYgt
ILBMcksize = cksizevalue
ltBMHDgt rarr ―BMHD ltcksize TYPE=uint32gt ltBitmapHeadergt
ltBitMapheadergt rarr ltwidth type=UINT32gt
lt height type=UINT16gt
lt xposition type=INT16gt
ltyposition type=INT16gt
ltnplanes type=BYTEgt
ltmasking type=BYTEgt
ltcompression type=BYTEgt
ltreserved type=BYTEgt
lttransparentcolor type=UINT16gt
ltxaspect type=BYTEgt
ltyaspect type=BYTEgt
ltpagewidth type=INT16gt
ltpageheight type=INT16gt
ltCMAPgt -rarr―CMAP ltcksize type=UINT32gt ltcolorgt[n] n=cksizeval3
ltCAMGgt rarr ―CAMG ltcksizetype=UINT32gt ltviewmode type=INT32gt
ltBODYgt rarr ―BODY ltcksize type=UINT32gt ltdata type=BYTEgt[cksize]
ltcolorgt rarr ltred type=BYTEgt ltgreen type= BYTEgt ltblue type=BYTEgt
Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format
8
Figure 4 A Parse Tree for the ILBM File hampic
9
42 RIFF Waveform PCM Audio File Format (WAVE)
The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp
Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF
rules with the cksize inserted To convert this grammar into a binary file format grammar data
types are added for primitive non-terminals
ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE
ltfmt-ckgt
[ltfact-ckgt]
[ltcue-ckgt]
[ltplaylist-ckgt]
[ltassoc-data-listgt]
ltwave-datagt
ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt
ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt
ltwChannelsWORDgt
ltdwSamplesPerSecDWORDgt
ltdwAvgBytesPerSecDWORDgt
ltwBlockAlignWORDgt
ltwave-datagt -gt ltdata-ckgt | ltwave-listgt
ltdata-ckgt -gt data cksize ltwave-datagt
ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt
ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt
ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt
ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+
ltcue-pointgt -gt ltdwNameDWORDgt
ltdwPositionDWORDgt
ltfccChunkFOURCCgt
ltdwChunkStartDWORDgt
ltdwBlockStartDWORDgt
ltdwSampleOffsetDWORDgt
ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+
ltplay-segmentgt -gt ltdwNameDWORDgt
ltdwLengthDWORDgt
ltdwLoopsDWORD
ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt
ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt
ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt
10
ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt
ltdwSampleLengthDWORDgt
ltdwPurposeDWORDgt
ltwCountryWORDgt
ltwLanguageWORDgt
ltwDialectWORDgt
ltwCodePageWORDgt
ltdataBYTEgt+
ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt
ltfileDataBYTEgt+
Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format
5 Recursive Descent Parsers for Binary File Grammars
A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)
procedures where each procedure implements one of the production rules of the grammar A
predictive parser is a recursive descent parser that does not require backtracking Predictive
parsing is possible only for the class of LL(k) grammars which are the context-free grammars
for which there exists some positive integer k that allows a recursive descent parser to decide
which production to use by examining only the next k tokens of input A recursive descent parser
for the binary file format grammars defined in this paper does not require a separate lexer to
identify the data of a file The data types are predicted by the grammar rules
The semantic rules of an L-attributed grammar can be included in the recursive descent parser to
check for context-sensitive aspects of the grammar such as chunk size or image dimensions and
use these values in parsing the chunk data or images The pointers to file addresses that occur in
the file formats are handled by pushing the current address in the file onto a pushdown stack
When the parser exhausts the procedures implementing production rules corresponding to the
pointed to location the topmost address on the pushdown stack is popped and the procedure
corresponding to the nonterminal at that position is executed
Recursive descent parsers with a pushdown stack can be used with binary file format grammars
to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads
a file and generates an error if the file does not conform to the syntax specified by the grammar
11
6 A Parser Generator for Recognizing Binary File Formats
What we want is a parser generator whose input is a binary file attribute grammar for a particular
binary file format and whose generated output is the Java source code of a recursive descent
parser (with a pushdown stack if needed) for the class of binary file formats specified by the
grammar Such a parser generator does not yet exist but there is a widely used parser generator
for attribute grammars for string-based languages
61 ANTLR
ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing
[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a
language and generates as output source code for a recognizer for that language ANTLR
supports generating code in a number of the programming languages including C Java
JavaScript and Python
ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that
converts a sequence of characters into tokens A lexer is not needed to parse binary file formats
because the data types predicted by the parser are the tokens of the grammar Furthermore the
capability to recognize LL() grammars is not needed because the binary file format grammars
specified so far do not require lookahead of k symbols
62 Using ANTLR to Generate a Parser for a Binary File Format
It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-
based and directory-based binary file formats This is accomplished by writing a lexer that treats
each input byte in a file as a character token Then functions are created for each data type in the
binary file grammar that convert the appropriate number of character tokens to binary data types
eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our
attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a
parser generator for binary file grammars
621 An ANTLR Grammar for an ILBM Chunk-based File Format
Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of
ANTLR is a lexer and parser for ILBM files
12
grammar iblm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in lexerheader is placed at the top of the lexer file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Grammar Rules used to create the Parser
Start symbol ilbm
iblm returns [boolean result]
FORM ckSize ILBM
Print out ILBM chunksize
Systemoutprintln(ID ILBM Size + $ckSizevalue)
Semantic predicate that ensures that at least on BMHD is found
propertyChunks $propertyChunksfoundBMHD == true
dataChunks
body
$result = true
The bmhd cmap and camg property chunks can occur in any order
At least one BMHD data chunk must be present
propertyChunks returns [boolean foundBMHD]
(bmhd $foundBMHD = true
| cmap
| camg)+
bmhd
BMHD ckSize
Systemoutprintln(ID BMHD Size + $ckSizevalue)
13
bitMapHeader
bitMapHeader
width = uWordValue
height = uWordValue
xPosition = wordValue
yPosition = wordValue
nPanes = uByteValue
masking = uByteValue
compression = uByteValue
reserved1 = uByteValue
transparentColor = uWordValue
xAspect = uByteValue
yAspect = uByteValue
pageWidth = wordValue
pageHeight = wordValue
Print out values in the bitmap structure
Systemoutprintln( Width + $widthvalue)
Systemoutprintln( Height + $heightvalue)
Systemoutprintln( X Position + $xPositionvalue)
Systemoutprintln( Y Position + $yPositionvalue)
Systemoutprintln( Number of Panes + $nPanesvalue)
Systemoutprintln( Masking + $maskingvalue)
Systemoutprintln( Compression + $compressionvalue)
Systemoutprintln( Reserved1 + $reserved1value)
Systemoutprintln( TransParent Color +
$transparentColorvalue)
Systemoutprintln( X Aspest + $xAspectvalue)
Systemoutprintln( Y Aspect + $yAspectvalue)
Systemoutprintln( Page Width + $pageWidthvalue)
Systemoutprintln( Page Height + $pageHeightvalue)
cmap returns [long value]
Initialize the index value i in the cmap function
init int i = 0
CMAP ckSize
Print out chunk size
$value = $ckSizevalue
Systemoutprintln(ID CMAP Size + $ckSizevalue)
(colorRegister
Print out color values
Systemoutprintln( Color[ + i++ + ] +
$colorRegisterredValue +
+ $colorRegistergreenValue +
+ $colorRegisterblueValue +
)
)+
colorRegister returns [int redValue int greenValue int blueValue]
14
red = uByteValue green = uByteValue blue = uByteValue
$redValue = $redvalue
$greenValue = $greenvalue
$blueValue = $bluevalue
camg
CAMG ckSize longValue
Systemoutprintln(ID CAMG Size + $ckSizevalue)
Systemoutprintln( View Modes 0x +
LongtoHexString($longValuevalue))
The crng and ccrt data chunks
dataChunks
crng
| ccrt
crng
CRNG ckSize
Systemoutprintln(ID CRNG Size + $ckSizevalue)
cRange
cRange
pad1 = wordValue $pad1value == 0
rate = wordValue
active = wordValue
low = uByteValue
high = uByteValue
ccrt
CCRT ckSize
Systemoutprintln(ID CCRT Size + $ckSizevalue)
cycleInfo
cycleInfo
direction = wordValue
start = uByteValue
end = uByteValue
seconds = longValue
microseconds = longValue
pad = wordValue $padvalue == 0
body
BODY ckSize
Systemoutprintln(ID BODY Size + $ckSizevalue)
$ckSizevalue gt 0 data
Systemoutprintln( Number of bytes + $datavalue)
$datavalue == $ckSizevalue
defines the data chunk of the body chunk
data returns [long value]
(b+=BYTE)+ $bsize() gt 0
$value = $bsize()
15
The cksize is the value of a long integer
ckSize returns [long value]
longValue
$value = $longValuevalue
defines the longValue data typein terms of BYTE terminals
longValue returns [long value] throws UnsupportedEncodingException
h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE
$value = (long)
( getUByteValue($h1) ltlt 24 |
getUByteValue($h2) ltlt 16 |
getUByteValue($l1) ltlt 8 |
getUByteValue($l2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uWordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (short)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uByteValue returns [int value]
BYTE
calls the getUByteValue from a BYTE token
16
$value = (int) getUByteValue($BYTE)
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rules used to create the Lexer
FORM FORM
ILBM ILBM
BMHD BMHD
CMAP CMAP
CAMG CAMG
CRNG CRNG
CCRT CCRT
BODY BODY
BYTE
Figure 6 An ANTLR Grammar for the ILBM File Format
622 The Lexer and Parser
When provided the file described in the previous section ANTLR generates the ilbmlexer and
ilbmparser Java programs The Java application shown in Figure 6 uses these programs to
validate the ILBM file BLKIFF It produces the output shown in Figure 7
package orggtrigrammarschunk
import javaioIOException
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRStringStream
import organtlrruntimeCharStream
import organtlrruntimeCommonTokenStream
import organtlrruntimeRecognitionException
17
import organtlrruntimeTokenStream
import gtriorggrammarschunkiblmLexer
import gtriorggrammarschunkiblmParser
public class Test
static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF
public static void main(String[] args) throws RecognitionException IOException
ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)
iblmLexer lexer = new iblmLexer(stream)
TokenStream tokenStream = new CommonTokenStream(lexer)
iblmParser parser = new iblmParser(tokenStream)
Systemoutprintln(FILE + filePath)
Systemoutprintln(Successfully parsed = + parseriblm())
Figure 7 A Java Application for Parsing an ILBM File
FILE CUserssl170PicturesIFF-ilbmBLKIFF
ID ILBM Size 2556
ID BMHD Size 20
Width 200
Height 144
X Position 0
Y Position 0
Number of Panes 6
Masking0
Compression1
Reserved10
TransParent Color0
X Aspest3
Y Aspect3
Page Width200
Page Height144
ID CMAP Size 48
Color[0] 0 0 0
Color[1] 0 128 128
Color[2] 0 255 255
Color[3] 255 0 255
Color[4] 0 255 0
Color[5] 255 0 0
Color[6] 0 0 255
Color[7] 255 255 0
Color[8] 128 128 128
Color[9] 192 192 192
Color[10] 128 0 0
Color[11] 0 128 0
Color[12] 128 128 0
Color[13] 0 0 128
Color[14] 128 0 128
18
Color[15] 255 255 255
ID CAMG Size 4
View Modes 0x800
ID BODY Size 2448
Number of bytes 2448
Successfully parsed = true
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF
63 An ANTLR Grammar for the RIFF Waveform PCM File Format
Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR
The output of ANTLR is a lexer and parser for WAVEFOR PCM files
grammar waveform_pcm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text contained in lexerheader is placed at the top of the lexer
file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Default ByteOrder to Little Endian
String byteOrder = II
Rules used to create the Parser
start symbol waveform_pcm
waveform_pcm returns [boolean result]
19
ckId = fourCC $ckIdvalueequals(RIFF)
ckSize = dWordValue
formType = fourCC $formTypevalueequals(WAVE)
formatSubChunk
dataSubChunk
$result = true
fourCC returns [String value]
BYTE BYTE BYTE BYTE $value = $text
formatSubChunk
subChunkId = fourCC $subChunkIdvalueequals(fmt )
ckSize = dWordValue
commonFields
pcmSpecificField
commonFields
formatTag = wordValue Format category
channels = wordValue Number of channels
samplesPerSec = dWordValue Sampling rate
avgBytesPerSec = dWordValue For buffer estimation
blockAlign = wordValue Data block size
pcmSpecificField
bitsPerSample = wordValue Sample size
dataSubChunk
subChunkId = fourCC $subChunkIdvalueequals(data)
ckSize = dWordValue
waveData[$ckSizevalue]
waveData [long size]
(size gt= 0=gt BYTE size--)+
defines the long data type in terms of BYTE terminals
dWordValue returns [long value] throws UnsupportedEncodingException
b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE
if (byteOrderequals(II))
$value = (long)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 |
getUByteValue($b3) ltlt 16 |
getUByteValue($b4) ltlt 24 )
else
$value = (long)
( getUByteValue($b1) ltlt 24 |
getUByteValue($b2) ltlt 16 |
20
getUByteValue($b3) ltlt 8 |
getUByteValue($b4) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
if (byteOrderequals(II))
$value = (int)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 )
else
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rule used to create the Lexer
BYTE
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format
7 Related Research
Researchers have developed a number of data description languages to support validation of file
formats EAST (Enhanced ADA Subset) is a data description language developed by the
Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary
Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic
information The EAST description is used to interpret and provide access to information in
binary and text files
DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to
manipulate Java jar files and ELF object files The PADSC data description language can be
used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC
compiler compiles the description into tools that can be used to recognize manipulate and
transform the data into other formats
21
The Data Format Description Language (DFDL) is being developed by a Working Group of the
Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML
Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-
XML physical representations Version 1 of the language specification was published in 2011
and a parser is being implemented
Each of the data description languages described above can be used to define data types and file
structures of binary files However they are based on file structure declarations such as those of
C or Java The binary file format grammar described in this paper most closely resembles DFDL
However the binary file grammar described in this paper is the only data description language
based on formal grammars that is used for creating recognizers for file formats
JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to
provide automated and efficient identification and validation of the formats of digital files
JHOVE is a format-specific digital object validation API written in Java JHOVE supports
validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF
UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California
digital Library 2011]
The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation
of binary file formats As a matter of fact binary file grammars and parsers have been
constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-
based format TIFF However the research reported herein differs from that of the JHOVE
project in that what is sought is a technology for generating validators for binary file formats
from grammars specifying the binary file format
8 Conclusion
The research question addressed by his research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files Amajor family of binary file formats chunk-based was described Then extensions
to the concepts of context-free grammars and attribute grammars were described that enable the
specification of binary file formats Examples of attribute array grammars for two chunk-based
binary file format were then presented Recursive descent parsers for this classes of grammars
was then described Experience in using ANTLR a parser generator for LL() string grammars
in generating parsers for recognizing the formats of binary files was discussed Finally related
research in data description languages for binary file formats was described and the JHOVE
project in creating Java-based tools for validating file formats was described
22
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats Furthermore these grammars can be used with recursive descent
parsers for validating the file formats of chunk-based binary files It remains to be determined
whether these attribute grammars based on binary file (array) grammars are adequate to define
other families of binary file formats
To be a practical technology for generating recognizers (validators) for binary file formats a
parser generator is needed that creates from a binary file grammar a recursive descent parser
(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and
generates a lexer as well as a parser is too heavyweight a parser generator for this task and does
not have the requisite data types built in
A capability to validate binary file formats assumes that the file format of a file is already
known In this research we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009] We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendix a We also have file
signature tests (magic tests) for all these file types
We would like to have a parser generator (also called a compilerndashcompiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents The
closest parser generator that we have found to having these capabilities is ANTLR
(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar However it works with strings and generates a separate scanner
Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based
file formats We have specifications for each of file formats and samples of files with these
formats
These file formats have a similar structure We have developed binary file grammars for two of
these binary file formats We should be able to do so for all of them We used these grammars
with the extensions to ANTLR to generate top-down recursive descent parsers for the file
formats defined with these grammars
The significance of success in this research task is that if binary file formats can be specified
with binary file grammars then only one parsing generator is needed to generate the parsers for
verifying that a file conforms to a particular format Similarly the same parsing generator can be
used for conversion of legacy file formats to current or standard formats Finally the same parser
generator can be used for generating viewerplayers for most file formats This would increase
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
7
Figure 2 ILBM File hampic Displayed
ltILBMgt rarr ―FORM ltcksize TYPE=UINT32gt ―ILBM ltBMHDgt ltCMAPgt ltCAMGgt ltBODYgt
ILBMcksize = cksizevalue
ltBMHDgt rarr ―BMHD ltcksize TYPE=uint32gt ltBitmapHeadergt
ltBitMapheadergt rarr ltwidth type=UINT32gt
lt height type=UINT16gt
lt xposition type=INT16gt
ltyposition type=INT16gt
ltnplanes type=BYTEgt
ltmasking type=BYTEgt
ltcompression type=BYTEgt
ltreserved type=BYTEgt
lttransparentcolor type=UINT16gt
ltxaspect type=BYTEgt
ltyaspect type=BYTEgt
ltpagewidth type=INT16gt
ltpageheight type=INT16gt
ltCMAPgt -rarr―CMAP ltcksize type=UINT32gt ltcolorgt[n] n=cksizeval3
ltCAMGgt rarr ―CAMG ltcksizetype=UINT32gt ltviewmode type=INT32gt
ltBODYgt rarr ―BODY ltcksize type=UINT32gt ltdata type=BYTEgt[cksize]
ltcolorgt rarr ltred type=BYTEgt ltgreen type= BYTEgt ltblue type=BYTEgt
Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format
8
Figure 4 A Parse Tree for the ILBM File hampic
9
42 RIFF Waveform PCM Audio File Format (WAVE)
The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp
Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF
rules with the cksize inserted To convert this grammar into a binary file format grammar data
types are added for primitive non-terminals
ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE
ltfmt-ckgt
[ltfact-ckgt]
[ltcue-ckgt]
[ltplaylist-ckgt]
[ltassoc-data-listgt]
ltwave-datagt
ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt
ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt
ltwChannelsWORDgt
ltdwSamplesPerSecDWORDgt
ltdwAvgBytesPerSecDWORDgt
ltwBlockAlignWORDgt
ltwave-datagt -gt ltdata-ckgt | ltwave-listgt
ltdata-ckgt -gt data cksize ltwave-datagt
ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt
ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt
ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt
ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+
ltcue-pointgt -gt ltdwNameDWORDgt
ltdwPositionDWORDgt
ltfccChunkFOURCCgt
ltdwChunkStartDWORDgt
ltdwBlockStartDWORDgt
ltdwSampleOffsetDWORDgt
ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+
ltplay-segmentgt -gt ltdwNameDWORDgt
ltdwLengthDWORDgt
ltdwLoopsDWORD
ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt
ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt
ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt
10
ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt
ltdwSampleLengthDWORDgt
ltdwPurposeDWORDgt
ltwCountryWORDgt
ltwLanguageWORDgt
ltwDialectWORDgt
ltwCodePageWORDgt
ltdataBYTEgt+
ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt
ltfileDataBYTEgt+
Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format
5 Recursive Descent Parsers for Binary File Grammars
A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)
procedures where each procedure implements one of the production rules of the grammar A
predictive parser is a recursive descent parser that does not require backtracking Predictive
parsing is possible only for the class of LL(k) grammars which are the context-free grammars
for which there exists some positive integer k that allows a recursive descent parser to decide
which production to use by examining only the next k tokens of input A recursive descent parser
for the binary file format grammars defined in this paper does not require a separate lexer to
identify the data of a file The data types are predicted by the grammar rules
The semantic rules of an L-attributed grammar can be included in the recursive descent parser to
check for context-sensitive aspects of the grammar such as chunk size or image dimensions and
use these values in parsing the chunk data or images The pointers to file addresses that occur in
the file formats are handled by pushing the current address in the file onto a pushdown stack
When the parser exhausts the procedures implementing production rules corresponding to the
pointed to location the topmost address on the pushdown stack is popped and the procedure
corresponding to the nonterminal at that position is executed
Recursive descent parsers with a pushdown stack can be used with binary file format grammars
to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads
a file and generates an error if the file does not conform to the syntax specified by the grammar
11
6 A Parser Generator for Recognizing Binary File Formats
What we want is a parser generator whose input is a binary file attribute grammar for a particular
binary file format and whose generated output is the Java source code of a recursive descent
parser (with a pushdown stack if needed) for the class of binary file formats specified by the
grammar Such a parser generator does not yet exist but there is a widely used parser generator
for attribute grammars for string-based languages
61 ANTLR
ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing
[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a
language and generates as output source code for a recognizer for that language ANTLR
supports generating code in a number of the programming languages including C Java
JavaScript and Python
ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that
converts a sequence of characters into tokens A lexer is not needed to parse binary file formats
because the data types predicted by the parser are the tokens of the grammar Furthermore the
capability to recognize LL() grammars is not needed because the binary file format grammars
specified so far do not require lookahead of k symbols
62 Using ANTLR to Generate a Parser for a Binary File Format
It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-
based and directory-based binary file formats This is accomplished by writing a lexer that treats
each input byte in a file as a character token Then functions are created for each data type in the
binary file grammar that convert the appropriate number of character tokens to binary data types
eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our
attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a
parser generator for binary file grammars
621 An ANTLR Grammar for an ILBM Chunk-based File Format
Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of
ANTLR is a lexer and parser for ILBM files
12
grammar iblm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in lexerheader is placed at the top of the lexer file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Grammar Rules used to create the Parser
Start symbol ilbm
iblm returns [boolean result]
FORM ckSize ILBM
Print out ILBM chunksize
Systemoutprintln(ID ILBM Size + $ckSizevalue)
Semantic predicate that ensures that at least on BMHD is found
propertyChunks $propertyChunksfoundBMHD == true
dataChunks
body
$result = true
The bmhd cmap and camg property chunks can occur in any order
At least one BMHD data chunk must be present
propertyChunks returns [boolean foundBMHD]
(bmhd $foundBMHD = true
| cmap
| camg)+
bmhd
BMHD ckSize
Systemoutprintln(ID BMHD Size + $ckSizevalue)
13
bitMapHeader
bitMapHeader
width = uWordValue
height = uWordValue
xPosition = wordValue
yPosition = wordValue
nPanes = uByteValue
masking = uByteValue
compression = uByteValue
reserved1 = uByteValue
transparentColor = uWordValue
xAspect = uByteValue
yAspect = uByteValue
pageWidth = wordValue
pageHeight = wordValue
Print out values in the bitmap structure
Systemoutprintln( Width + $widthvalue)
Systemoutprintln( Height + $heightvalue)
Systemoutprintln( X Position + $xPositionvalue)
Systemoutprintln( Y Position + $yPositionvalue)
Systemoutprintln( Number of Panes + $nPanesvalue)
Systemoutprintln( Masking + $maskingvalue)
Systemoutprintln( Compression + $compressionvalue)
Systemoutprintln( Reserved1 + $reserved1value)
Systemoutprintln( TransParent Color +
$transparentColorvalue)
Systemoutprintln( X Aspest + $xAspectvalue)
Systemoutprintln( Y Aspect + $yAspectvalue)
Systemoutprintln( Page Width + $pageWidthvalue)
Systemoutprintln( Page Height + $pageHeightvalue)
cmap returns [long value]
Initialize the index value i in the cmap function
init int i = 0
CMAP ckSize
Print out chunk size
$value = $ckSizevalue
Systemoutprintln(ID CMAP Size + $ckSizevalue)
(colorRegister
Print out color values
Systemoutprintln( Color[ + i++ + ] +
$colorRegisterredValue +
+ $colorRegistergreenValue +
+ $colorRegisterblueValue +
)
)+
colorRegister returns [int redValue int greenValue int blueValue]
14
red = uByteValue green = uByteValue blue = uByteValue
$redValue = $redvalue
$greenValue = $greenvalue
$blueValue = $bluevalue
camg
CAMG ckSize longValue
Systemoutprintln(ID CAMG Size + $ckSizevalue)
Systemoutprintln( View Modes 0x +
LongtoHexString($longValuevalue))
The crng and ccrt data chunks
dataChunks
crng
| ccrt
crng
CRNG ckSize
Systemoutprintln(ID CRNG Size + $ckSizevalue)
cRange
cRange
pad1 = wordValue $pad1value == 0
rate = wordValue
active = wordValue
low = uByteValue
high = uByteValue
ccrt
CCRT ckSize
Systemoutprintln(ID CCRT Size + $ckSizevalue)
cycleInfo
cycleInfo
direction = wordValue
start = uByteValue
end = uByteValue
seconds = longValue
microseconds = longValue
pad = wordValue $padvalue == 0
body
BODY ckSize
Systemoutprintln(ID BODY Size + $ckSizevalue)
$ckSizevalue gt 0 data
Systemoutprintln( Number of bytes + $datavalue)
$datavalue == $ckSizevalue
defines the data chunk of the body chunk
data returns [long value]
(b+=BYTE)+ $bsize() gt 0
$value = $bsize()
15
The cksize is the value of a long integer
ckSize returns [long value]
longValue
$value = $longValuevalue
defines the longValue data typein terms of BYTE terminals
longValue returns [long value] throws UnsupportedEncodingException
h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE
$value = (long)
( getUByteValue($h1) ltlt 24 |
getUByteValue($h2) ltlt 16 |
getUByteValue($l1) ltlt 8 |
getUByteValue($l2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uWordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (short)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uByteValue returns [int value]
BYTE
calls the getUByteValue from a BYTE token
16
$value = (int) getUByteValue($BYTE)
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rules used to create the Lexer
FORM FORM
ILBM ILBM
BMHD BMHD
CMAP CMAP
CAMG CAMG
CRNG CRNG
CCRT CCRT
BODY BODY
BYTE
Figure 6 An ANTLR Grammar for the ILBM File Format
622 The Lexer and Parser
When provided the file described in the previous section ANTLR generates the ilbmlexer and
ilbmparser Java programs The Java application shown in Figure 6 uses these programs to
validate the ILBM file BLKIFF It produces the output shown in Figure 7
package orggtrigrammarschunk
import javaioIOException
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRStringStream
import organtlrruntimeCharStream
import organtlrruntimeCommonTokenStream
import organtlrruntimeRecognitionException
17
import organtlrruntimeTokenStream
import gtriorggrammarschunkiblmLexer
import gtriorggrammarschunkiblmParser
public class Test
static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF
public static void main(String[] args) throws RecognitionException IOException
ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)
iblmLexer lexer = new iblmLexer(stream)
TokenStream tokenStream = new CommonTokenStream(lexer)
iblmParser parser = new iblmParser(tokenStream)
Systemoutprintln(FILE + filePath)
Systemoutprintln(Successfully parsed = + parseriblm())
Figure 7 A Java Application for Parsing an ILBM File
FILE CUserssl170PicturesIFF-ilbmBLKIFF
ID ILBM Size 2556
ID BMHD Size 20
Width 200
Height 144
X Position 0
Y Position 0
Number of Panes 6
Masking0
Compression1
Reserved10
TransParent Color0
X Aspest3
Y Aspect3
Page Width200
Page Height144
ID CMAP Size 48
Color[0] 0 0 0
Color[1] 0 128 128
Color[2] 0 255 255
Color[3] 255 0 255
Color[4] 0 255 0
Color[5] 255 0 0
Color[6] 0 0 255
Color[7] 255 255 0
Color[8] 128 128 128
Color[9] 192 192 192
Color[10] 128 0 0
Color[11] 0 128 0
Color[12] 128 128 0
Color[13] 0 0 128
Color[14] 128 0 128
18
Color[15] 255 255 255
ID CAMG Size 4
View Modes 0x800
ID BODY Size 2448
Number of bytes 2448
Successfully parsed = true
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF
63 An ANTLR Grammar for the RIFF Waveform PCM File Format
Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR
The output of ANTLR is a lexer and parser for WAVEFOR PCM files
grammar waveform_pcm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text contained in lexerheader is placed at the top of the lexer
file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Default ByteOrder to Little Endian
String byteOrder = II
Rules used to create the Parser
start symbol waveform_pcm
waveform_pcm returns [boolean result]
19
ckId = fourCC $ckIdvalueequals(RIFF)
ckSize = dWordValue
formType = fourCC $formTypevalueequals(WAVE)
formatSubChunk
dataSubChunk
$result = true
fourCC returns [String value]
BYTE BYTE BYTE BYTE $value = $text
formatSubChunk
subChunkId = fourCC $subChunkIdvalueequals(fmt )
ckSize = dWordValue
commonFields
pcmSpecificField
commonFields
formatTag = wordValue Format category
channels = wordValue Number of channels
samplesPerSec = dWordValue Sampling rate
avgBytesPerSec = dWordValue For buffer estimation
blockAlign = wordValue Data block size
pcmSpecificField
bitsPerSample = wordValue Sample size
dataSubChunk
subChunkId = fourCC $subChunkIdvalueequals(data)
ckSize = dWordValue
waveData[$ckSizevalue]
waveData [long size]
(size gt= 0=gt BYTE size--)+
defines the long data type in terms of BYTE terminals
dWordValue returns [long value] throws UnsupportedEncodingException
b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE
if (byteOrderequals(II))
$value = (long)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 |
getUByteValue($b3) ltlt 16 |
getUByteValue($b4) ltlt 24 )
else
$value = (long)
( getUByteValue($b1) ltlt 24 |
getUByteValue($b2) ltlt 16 |
20
getUByteValue($b3) ltlt 8 |
getUByteValue($b4) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
if (byteOrderequals(II))
$value = (int)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 )
else
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rule used to create the Lexer
BYTE
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format
7 Related Research
Researchers have developed a number of data description languages to support validation of file
formats EAST (Enhanced ADA Subset) is a data description language developed by the
Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary
Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic
information The EAST description is used to interpret and provide access to information in
binary and text files
DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to
manipulate Java jar files and ELF object files The PADSC data description language can be
used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC
compiler compiles the description into tools that can be used to recognize manipulate and
transform the data into other formats
21
The Data Format Description Language (DFDL) is being developed by a Working Group of the
Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML
Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-
XML physical representations Version 1 of the language specification was published in 2011
and a parser is being implemented
Each of the data description languages described above can be used to define data types and file
structures of binary files However they are based on file structure declarations such as those of
C or Java The binary file format grammar described in this paper most closely resembles DFDL
However the binary file grammar described in this paper is the only data description language
based on formal grammars that is used for creating recognizers for file formats
JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to
provide automated and efficient identification and validation of the formats of digital files
JHOVE is a format-specific digital object validation API written in Java JHOVE supports
validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF
UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California
digital Library 2011]
The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation
of binary file formats As a matter of fact binary file grammars and parsers have been
constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-
based format TIFF However the research reported herein differs from that of the JHOVE
project in that what is sought is a technology for generating validators for binary file formats
from grammars specifying the binary file format
8 Conclusion
The research question addressed by his research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files Amajor family of binary file formats chunk-based was described Then extensions
to the concepts of context-free grammars and attribute grammars were described that enable the
specification of binary file formats Examples of attribute array grammars for two chunk-based
binary file format were then presented Recursive descent parsers for this classes of grammars
was then described Experience in using ANTLR a parser generator for LL() string grammars
in generating parsers for recognizing the formats of binary files was discussed Finally related
research in data description languages for binary file formats was described and the JHOVE
project in creating Java-based tools for validating file formats was described
22
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats Furthermore these grammars can be used with recursive descent
parsers for validating the file formats of chunk-based binary files It remains to be determined
whether these attribute grammars based on binary file (array) grammars are adequate to define
other families of binary file formats
To be a practical technology for generating recognizers (validators) for binary file formats a
parser generator is needed that creates from a binary file grammar a recursive descent parser
(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and
generates a lexer as well as a parser is too heavyweight a parser generator for this task and does
not have the requisite data types built in
A capability to validate binary file formats assumes that the file format of a file is already
known In this research we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009] We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendix a We also have file
signature tests (magic tests) for all these file types
We would like to have a parser generator (also called a compilerndashcompiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents The
closest parser generator that we have found to having these capabilities is ANTLR
(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar However it works with strings and generates a separate scanner
Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based
file formats We have specifications for each of file formats and samples of files with these
formats
These file formats have a similar structure We have developed binary file grammars for two of
these binary file formats We should be able to do so for all of them We used these grammars
with the extensions to ANTLR to generate top-down recursive descent parsers for the file
formats defined with these grammars
The significance of success in this research task is that if binary file formats can be specified
with binary file grammars then only one parsing generator is needed to generate the parsers for
verifying that a file conforms to a particular format Similarly the same parsing generator can be
used for conversion of legacy file formats to current or standard formats Finally the same parser
generator can be used for generating viewerplayers for most file formats This would increase
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
8
Figure 4 A Parse Tree for the ILBM File hampic
9
42 RIFF Waveform PCM Audio File Format (WAVE)
The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp
Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF
rules with the cksize inserted To convert this grammar into a binary file format grammar data
types are added for primitive non-terminals
ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE
ltfmt-ckgt
[ltfact-ckgt]
[ltcue-ckgt]
[ltplaylist-ckgt]
[ltassoc-data-listgt]
ltwave-datagt
ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt
ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt
ltwChannelsWORDgt
ltdwSamplesPerSecDWORDgt
ltdwAvgBytesPerSecDWORDgt
ltwBlockAlignWORDgt
ltwave-datagt -gt ltdata-ckgt | ltwave-listgt
ltdata-ckgt -gt data cksize ltwave-datagt
ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt
ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt
ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt
ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+
ltcue-pointgt -gt ltdwNameDWORDgt
ltdwPositionDWORDgt
ltfccChunkFOURCCgt
ltdwChunkStartDWORDgt
ltdwBlockStartDWORDgt
ltdwSampleOffsetDWORDgt
ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+
ltplay-segmentgt -gt ltdwNameDWORDgt
ltdwLengthDWORDgt
ltdwLoopsDWORD
ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt
ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt
ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt
10
ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt
ltdwSampleLengthDWORDgt
ltdwPurposeDWORDgt
ltwCountryWORDgt
ltwLanguageWORDgt
ltwDialectWORDgt
ltwCodePageWORDgt
ltdataBYTEgt+
ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt
ltfileDataBYTEgt+
Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format
5 Recursive Descent Parsers for Binary File Grammars
A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)
procedures where each procedure implements one of the production rules of the grammar A
predictive parser is a recursive descent parser that does not require backtracking Predictive
parsing is possible only for the class of LL(k) grammars which are the context-free grammars
for which there exists some positive integer k that allows a recursive descent parser to decide
which production to use by examining only the next k tokens of input A recursive descent parser
for the binary file format grammars defined in this paper does not require a separate lexer to
identify the data of a file The data types are predicted by the grammar rules
The semantic rules of an L-attributed grammar can be included in the recursive descent parser to
check for context-sensitive aspects of the grammar such as chunk size or image dimensions and
use these values in parsing the chunk data or images The pointers to file addresses that occur in
the file formats are handled by pushing the current address in the file onto a pushdown stack
When the parser exhausts the procedures implementing production rules corresponding to the
pointed to location the topmost address on the pushdown stack is popped and the procedure
corresponding to the nonterminal at that position is executed
Recursive descent parsers with a pushdown stack can be used with binary file format grammars
to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads
a file and generates an error if the file does not conform to the syntax specified by the grammar
11
6 A Parser Generator for Recognizing Binary File Formats
What we want is a parser generator whose input is a binary file attribute grammar for a particular
binary file format and whose generated output is the Java source code of a recursive descent
parser (with a pushdown stack if needed) for the class of binary file formats specified by the
grammar Such a parser generator does not yet exist but there is a widely used parser generator
for attribute grammars for string-based languages
61 ANTLR
ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing
[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a
language and generates as output source code for a recognizer for that language ANTLR
supports generating code in a number of the programming languages including C Java
JavaScript and Python
ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that
converts a sequence of characters into tokens A lexer is not needed to parse binary file formats
because the data types predicted by the parser are the tokens of the grammar Furthermore the
capability to recognize LL() grammars is not needed because the binary file format grammars
specified so far do not require lookahead of k symbols
62 Using ANTLR to Generate a Parser for a Binary File Format
It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-
based and directory-based binary file formats This is accomplished by writing a lexer that treats
each input byte in a file as a character token Then functions are created for each data type in the
binary file grammar that convert the appropriate number of character tokens to binary data types
eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our
attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a
parser generator for binary file grammars
621 An ANTLR Grammar for an ILBM Chunk-based File Format
Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of
ANTLR is a lexer and parser for ILBM files
12
grammar iblm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in lexerheader is placed at the top of the lexer file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Grammar Rules used to create the Parser
Start symbol ilbm
iblm returns [boolean result]
FORM ckSize ILBM
Print out ILBM chunksize
Systemoutprintln(ID ILBM Size + $ckSizevalue)
Semantic predicate that ensures that at least on BMHD is found
propertyChunks $propertyChunksfoundBMHD == true
dataChunks
body
$result = true
The bmhd cmap and camg property chunks can occur in any order
At least one BMHD data chunk must be present
propertyChunks returns [boolean foundBMHD]
(bmhd $foundBMHD = true
| cmap
| camg)+
bmhd
BMHD ckSize
Systemoutprintln(ID BMHD Size + $ckSizevalue)
13
bitMapHeader
bitMapHeader
width = uWordValue
height = uWordValue
xPosition = wordValue
yPosition = wordValue
nPanes = uByteValue
masking = uByteValue
compression = uByteValue
reserved1 = uByteValue
transparentColor = uWordValue
xAspect = uByteValue
yAspect = uByteValue
pageWidth = wordValue
pageHeight = wordValue
Print out values in the bitmap structure
Systemoutprintln( Width + $widthvalue)
Systemoutprintln( Height + $heightvalue)
Systemoutprintln( X Position + $xPositionvalue)
Systemoutprintln( Y Position + $yPositionvalue)
Systemoutprintln( Number of Panes + $nPanesvalue)
Systemoutprintln( Masking + $maskingvalue)
Systemoutprintln( Compression + $compressionvalue)
Systemoutprintln( Reserved1 + $reserved1value)
Systemoutprintln( TransParent Color +
$transparentColorvalue)
Systemoutprintln( X Aspest + $xAspectvalue)
Systemoutprintln( Y Aspect + $yAspectvalue)
Systemoutprintln( Page Width + $pageWidthvalue)
Systemoutprintln( Page Height + $pageHeightvalue)
cmap returns [long value]
Initialize the index value i in the cmap function
init int i = 0
CMAP ckSize
Print out chunk size
$value = $ckSizevalue
Systemoutprintln(ID CMAP Size + $ckSizevalue)
(colorRegister
Print out color values
Systemoutprintln( Color[ + i++ + ] +
$colorRegisterredValue +
+ $colorRegistergreenValue +
+ $colorRegisterblueValue +
)
)+
colorRegister returns [int redValue int greenValue int blueValue]
14
red = uByteValue green = uByteValue blue = uByteValue
$redValue = $redvalue
$greenValue = $greenvalue
$blueValue = $bluevalue
camg
CAMG ckSize longValue
Systemoutprintln(ID CAMG Size + $ckSizevalue)
Systemoutprintln( View Modes 0x +
LongtoHexString($longValuevalue))
The crng and ccrt data chunks
dataChunks
crng
| ccrt
crng
CRNG ckSize
Systemoutprintln(ID CRNG Size + $ckSizevalue)
cRange
cRange
pad1 = wordValue $pad1value == 0
rate = wordValue
active = wordValue
low = uByteValue
high = uByteValue
ccrt
CCRT ckSize
Systemoutprintln(ID CCRT Size + $ckSizevalue)
cycleInfo
cycleInfo
direction = wordValue
start = uByteValue
end = uByteValue
seconds = longValue
microseconds = longValue
pad = wordValue $padvalue == 0
body
BODY ckSize
Systemoutprintln(ID BODY Size + $ckSizevalue)
$ckSizevalue gt 0 data
Systemoutprintln( Number of bytes + $datavalue)
$datavalue == $ckSizevalue
defines the data chunk of the body chunk
data returns [long value]
(b+=BYTE)+ $bsize() gt 0
$value = $bsize()
15
The cksize is the value of a long integer
ckSize returns [long value]
longValue
$value = $longValuevalue
defines the longValue data typein terms of BYTE terminals
longValue returns [long value] throws UnsupportedEncodingException
h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE
$value = (long)
( getUByteValue($h1) ltlt 24 |
getUByteValue($h2) ltlt 16 |
getUByteValue($l1) ltlt 8 |
getUByteValue($l2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uWordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (short)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uByteValue returns [int value]
BYTE
calls the getUByteValue from a BYTE token
16
$value = (int) getUByteValue($BYTE)
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rules used to create the Lexer
FORM FORM
ILBM ILBM
BMHD BMHD
CMAP CMAP
CAMG CAMG
CRNG CRNG
CCRT CCRT
BODY BODY
BYTE
Figure 6 An ANTLR Grammar for the ILBM File Format
622 The Lexer and Parser
When provided the file described in the previous section ANTLR generates the ilbmlexer and
ilbmparser Java programs The Java application shown in Figure 6 uses these programs to
validate the ILBM file BLKIFF It produces the output shown in Figure 7
package orggtrigrammarschunk
import javaioIOException
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRStringStream
import organtlrruntimeCharStream
import organtlrruntimeCommonTokenStream
import organtlrruntimeRecognitionException
17
import organtlrruntimeTokenStream
import gtriorggrammarschunkiblmLexer
import gtriorggrammarschunkiblmParser
public class Test
static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF
public static void main(String[] args) throws RecognitionException IOException
ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)
iblmLexer lexer = new iblmLexer(stream)
TokenStream tokenStream = new CommonTokenStream(lexer)
iblmParser parser = new iblmParser(tokenStream)
Systemoutprintln(FILE + filePath)
Systemoutprintln(Successfully parsed = + parseriblm())
Figure 7 A Java Application for Parsing an ILBM File
FILE CUserssl170PicturesIFF-ilbmBLKIFF
ID ILBM Size 2556
ID BMHD Size 20
Width 200
Height 144
X Position 0
Y Position 0
Number of Panes 6
Masking0
Compression1
Reserved10
TransParent Color0
X Aspest3
Y Aspect3
Page Width200
Page Height144
ID CMAP Size 48
Color[0] 0 0 0
Color[1] 0 128 128
Color[2] 0 255 255
Color[3] 255 0 255
Color[4] 0 255 0
Color[5] 255 0 0
Color[6] 0 0 255
Color[7] 255 255 0
Color[8] 128 128 128
Color[9] 192 192 192
Color[10] 128 0 0
Color[11] 0 128 0
Color[12] 128 128 0
Color[13] 0 0 128
Color[14] 128 0 128
18
Color[15] 255 255 255
ID CAMG Size 4
View Modes 0x800
ID BODY Size 2448
Number of bytes 2448
Successfully parsed = true
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF
63 An ANTLR Grammar for the RIFF Waveform PCM File Format
Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR
The output of ANTLR is a lexer and parser for WAVEFOR PCM files
grammar waveform_pcm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text contained in lexerheader is placed at the top of the lexer
file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Default ByteOrder to Little Endian
String byteOrder = II
Rules used to create the Parser
start symbol waveform_pcm
waveform_pcm returns [boolean result]
19
ckId = fourCC $ckIdvalueequals(RIFF)
ckSize = dWordValue
formType = fourCC $formTypevalueequals(WAVE)
formatSubChunk
dataSubChunk
$result = true
fourCC returns [String value]
BYTE BYTE BYTE BYTE $value = $text
formatSubChunk
subChunkId = fourCC $subChunkIdvalueequals(fmt )
ckSize = dWordValue
commonFields
pcmSpecificField
commonFields
formatTag = wordValue Format category
channels = wordValue Number of channels
samplesPerSec = dWordValue Sampling rate
avgBytesPerSec = dWordValue For buffer estimation
blockAlign = wordValue Data block size
pcmSpecificField
bitsPerSample = wordValue Sample size
dataSubChunk
subChunkId = fourCC $subChunkIdvalueequals(data)
ckSize = dWordValue
waveData[$ckSizevalue]
waveData [long size]
(size gt= 0=gt BYTE size--)+
defines the long data type in terms of BYTE terminals
dWordValue returns [long value] throws UnsupportedEncodingException
b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE
if (byteOrderequals(II))
$value = (long)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 |
getUByteValue($b3) ltlt 16 |
getUByteValue($b4) ltlt 24 )
else
$value = (long)
( getUByteValue($b1) ltlt 24 |
getUByteValue($b2) ltlt 16 |
20
getUByteValue($b3) ltlt 8 |
getUByteValue($b4) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
if (byteOrderequals(II))
$value = (int)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 )
else
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rule used to create the Lexer
BYTE
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format
7 Related Research
Researchers have developed a number of data description languages to support validation of file
formats EAST (Enhanced ADA Subset) is a data description language developed by the
Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary
Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic
information The EAST description is used to interpret and provide access to information in
binary and text files
DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to
manipulate Java jar files and ELF object files The PADSC data description language can be
used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC
compiler compiles the description into tools that can be used to recognize manipulate and
transform the data into other formats
21
The Data Format Description Language (DFDL) is being developed by a Working Group of the
Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML
Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-
XML physical representations Version 1 of the language specification was published in 2011
and a parser is being implemented
Each of the data description languages described above can be used to define data types and file
structures of binary files However they are based on file structure declarations such as those of
C or Java The binary file format grammar described in this paper most closely resembles DFDL
However the binary file grammar described in this paper is the only data description language
based on formal grammars that is used for creating recognizers for file formats
JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to
provide automated and efficient identification and validation of the formats of digital files
JHOVE is a format-specific digital object validation API written in Java JHOVE supports
validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF
UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California
digital Library 2011]
The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation
of binary file formats As a matter of fact binary file grammars and parsers have been
constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-
based format TIFF However the research reported herein differs from that of the JHOVE
project in that what is sought is a technology for generating validators for binary file formats
from grammars specifying the binary file format
8 Conclusion
The research question addressed by his research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files Amajor family of binary file formats chunk-based was described Then extensions
to the concepts of context-free grammars and attribute grammars were described that enable the
specification of binary file formats Examples of attribute array grammars for two chunk-based
binary file format were then presented Recursive descent parsers for this classes of grammars
was then described Experience in using ANTLR a parser generator for LL() string grammars
in generating parsers for recognizing the formats of binary files was discussed Finally related
research in data description languages for binary file formats was described and the JHOVE
project in creating Java-based tools for validating file formats was described
22
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats Furthermore these grammars can be used with recursive descent
parsers for validating the file formats of chunk-based binary files It remains to be determined
whether these attribute grammars based on binary file (array) grammars are adequate to define
other families of binary file formats
To be a practical technology for generating recognizers (validators) for binary file formats a
parser generator is needed that creates from a binary file grammar a recursive descent parser
(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and
generates a lexer as well as a parser is too heavyweight a parser generator for this task and does
not have the requisite data types built in
A capability to validate binary file formats assumes that the file format of a file is already
known In this research we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009] We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendix a We also have file
signature tests (magic tests) for all these file types
We would like to have a parser generator (also called a compilerndashcompiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents The
closest parser generator that we have found to having these capabilities is ANTLR
(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar However it works with strings and generates a separate scanner
Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based
file formats We have specifications for each of file formats and samples of files with these
formats
These file formats have a similar structure We have developed binary file grammars for two of
these binary file formats We should be able to do so for all of them We used these grammars
with the extensions to ANTLR to generate top-down recursive descent parsers for the file
formats defined with these grammars
The significance of success in this research task is that if binary file formats can be specified
with binary file grammars then only one parsing generator is needed to generate the parsers for
verifying that a file conforms to a particular format Similarly the same parsing generator can be
used for conversion of legacy file formats to current or standard formats Finally the same parser
generator can be used for generating viewerplayers for most file formats This would increase
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
9
42 RIFF Waveform PCM Audio File Format (WAVE)
The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp
Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF
rules with the cksize inserted To convert this grammar into a binary file format grammar data
types are added for primitive non-terminals
ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE
ltfmt-ckgt
[ltfact-ckgt]
[ltcue-ckgt]
[ltplaylist-ckgt]
[ltassoc-data-listgt]
ltwave-datagt
ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt
ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt
ltwChannelsWORDgt
ltdwSamplesPerSecDWORDgt
ltdwAvgBytesPerSecDWORDgt
ltwBlockAlignWORDgt
ltwave-datagt -gt ltdata-ckgt | ltwave-listgt
ltdata-ckgt -gt data cksize ltwave-datagt
ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt
ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt
ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt
ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+
ltcue-pointgt -gt ltdwNameDWORDgt
ltdwPositionDWORDgt
ltfccChunkFOURCCgt
ltdwChunkStartDWORDgt
ltdwBlockStartDWORDgt
ltdwSampleOffsetDWORDgt
ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+
ltplay-segmentgt -gt ltdwNameDWORDgt
ltdwLengthDWORDgt
ltdwLoopsDWORD
ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt
ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt
ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt
10
ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt
ltdwSampleLengthDWORDgt
ltdwPurposeDWORDgt
ltwCountryWORDgt
ltwLanguageWORDgt
ltwDialectWORDgt
ltwCodePageWORDgt
ltdataBYTEgt+
ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt
ltfileDataBYTEgt+
Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format
5 Recursive Descent Parsers for Binary File Grammars
A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)
procedures where each procedure implements one of the production rules of the grammar A
predictive parser is a recursive descent parser that does not require backtracking Predictive
parsing is possible only for the class of LL(k) grammars which are the context-free grammars
for which there exists some positive integer k that allows a recursive descent parser to decide
which production to use by examining only the next k tokens of input A recursive descent parser
for the binary file format grammars defined in this paper does not require a separate lexer to
identify the data of a file The data types are predicted by the grammar rules
The semantic rules of an L-attributed grammar can be included in the recursive descent parser to
check for context-sensitive aspects of the grammar such as chunk size or image dimensions and
use these values in parsing the chunk data or images The pointers to file addresses that occur in
the file formats are handled by pushing the current address in the file onto a pushdown stack
When the parser exhausts the procedures implementing production rules corresponding to the
pointed to location the topmost address on the pushdown stack is popped and the procedure
corresponding to the nonterminal at that position is executed
Recursive descent parsers with a pushdown stack can be used with binary file format grammars
to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads
a file and generates an error if the file does not conform to the syntax specified by the grammar
11
6 A Parser Generator for Recognizing Binary File Formats
What we want is a parser generator whose input is a binary file attribute grammar for a particular
binary file format and whose generated output is the Java source code of a recursive descent
parser (with a pushdown stack if needed) for the class of binary file formats specified by the
grammar Such a parser generator does not yet exist but there is a widely used parser generator
for attribute grammars for string-based languages
61 ANTLR
ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing
[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a
language and generates as output source code for a recognizer for that language ANTLR
supports generating code in a number of the programming languages including C Java
JavaScript and Python
ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that
converts a sequence of characters into tokens A lexer is not needed to parse binary file formats
because the data types predicted by the parser are the tokens of the grammar Furthermore the
capability to recognize LL() grammars is not needed because the binary file format grammars
specified so far do not require lookahead of k symbols
62 Using ANTLR to Generate a Parser for a Binary File Format
It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-
based and directory-based binary file formats This is accomplished by writing a lexer that treats
each input byte in a file as a character token Then functions are created for each data type in the
binary file grammar that convert the appropriate number of character tokens to binary data types
eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our
attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a
parser generator for binary file grammars
621 An ANTLR Grammar for an ILBM Chunk-based File Format
Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of
ANTLR is a lexer and parser for ILBM files
12
grammar iblm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in lexerheader is placed at the top of the lexer file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Grammar Rules used to create the Parser
Start symbol ilbm
iblm returns [boolean result]
FORM ckSize ILBM
Print out ILBM chunksize
Systemoutprintln(ID ILBM Size + $ckSizevalue)
Semantic predicate that ensures that at least on BMHD is found
propertyChunks $propertyChunksfoundBMHD == true
dataChunks
body
$result = true
The bmhd cmap and camg property chunks can occur in any order
At least one BMHD data chunk must be present
propertyChunks returns [boolean foundBMHD]
(bmhd $foundBMHD = true
| cmap
| camg)+
bmhd
BMHD ckSize
Systemoutprintln(ID BMHD Size + $ckSizevalue)
13
bitMapHeader
bitMapHeader
width = uWordValue
height = uWordValue
xPosition = wordValue
yPosition = wordValue
nPanes = uByteValue
masking = uByteValue
compression = uByteValue
reserved1 = uByteValue
transparentColor = uWordValue
xAspect = uByteValue
yAspect = uByteValue
pageWidth = wordValue
pageHeight = wordValue
Print out values in the bitmap structure
Systemoutprintln( Width + $widthvalue)
Systemoutprintln( Height + $heightvalue)
Systemoutprintln( X Position + $xPositionvalue)
Systemoutprintln( Y Position + $yPositionvalue)
Systemoutprintln( Number of Panes + $nPanesvalue)
Systemoutprintln( Masking + $maskingvalue)
Systemoutprintln( Compression + $compressionvalue)
Systemoutprintln( Reserved1 + $reserved1value)
Systemoutprintln( TransParent Color +
$transparentColorvalue)
Systemoutprintln( X Aspest + $xAspectvalue)
Systemoutprintln( Y Aspect + $yAspectvalue)
Systemoutprintln( Page Width + $pageWidthvalue)
Systemoutprintln( Page Height + $pageHeightvalue)
cmap returns [long value]
Initialize the index value i in the cmap function
init int i = 0
CMAP ckSize
Print out chunk size
$value = $ckSizevalue
Systemoutprintln(ID CMAP Size + $ckSizevalue)
(colorRegister
Print out color values
Systemoutprintln( Color[ + i++ + ] +
$colorRegisterredValue +
+ $colorRegistergreenValue +
+ $colorRegisterblueValue +
)
)+
colorRegister returns [int redValue int greenValue int blueValue]
14
red = uByteValue green = uByteValue blue = uByteValue
$redValue = $redvalue
$greenValue = $greenvalue
$blueValue = $bluevalue
camg
CAMG ckSize longValue
Systemoutprintln(ID CAMG Size + $ckSizevalue)
Systemoutprintln( View Modes 0x +
LongtoHexString($longValuevalue))
The crng and ccrt data chunks
dataChunks
crng
| ccrt
crng
CRNG ckSize
Systemoutprintln(ID CRNG Size + $ckSizevalue)
cRange
cRange
pad1 = wordValue $pad1value == 0
rate = wordValue
active = wordValue
low = uByteValue
high = uByteValue
ccrt
CCRT ckSize
Systemoutprintln(ID CCRT Size + $ckSizevalue)
cycleInfo
cycleInfo
direction = wordValue
start = uByteValue
end = uByteValue
seconds = longValue
microseconds = longValue
pad = wordValue $padvalue == 0
body
BODY ckSize
Systemoutprintln(ID BODY Size + $ckSizevalue)
$ckSizevalue gt 0 data
Systemoutprintln( Number of bytes + $datavalue)
$datavalue == $ckSizevalue
defines the data chunk of the body chunk
data returns [long value]
(b+=BYTE)+ $bsize() gt 0
$value = $bsize()
15
The cksize is the value of a long integer
ckSize returns [long value]
longValue
$value = $longValuevalue
defines the longValue data typein terms of BYTE terminals
longValue returns [long value] throws UnsupportedEncodingException
h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE
$value = (long)
( getUByteValue($h1) ltlt 24 |
getUByteValue($h2) ltlt 16 |
getUByteValue($l1) ltlt 8 |
getUByteValue($l2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uWordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (short)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uByteValue returns [int value]
BYTE
calls the getUByteValue from a BYTE token
16
$value = (int) getUByteValue($BYTE)
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rules used to create the Lexer
FORM FORM
ILBM ILBM
BMHD BMHD
CMAP CMAP
CAMG CAMG
CRNG CRNG
CCRT CCRT
BODY BODY
BYTE
Figure 6 An ANTLR Grammar for the ILBM File Format
622 The Lexer and Parser
When provided the file described in the previous section ANTLR generates the ilbmlexer and
ilbmparser Java programs The Java application shown in Figure 6 uses these programs to
validate the ILBM file BLKIFF It produces the output shown in Figure 7
package orggtrigrammarschunk
import javaioIOException
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRStringStream
import organtlrruntimeCharStream
import organtlrruntimeCommonTokenStream
import organtlrruntimeRecognitionException
17
import organtlrruntimeTokenStream
import gtriorggrammarschunkiblmLexer
import gtriorggrammarschunkiblmParser
public class Test
static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF
public static void main(String[] args) throws RecognitionException IOException
ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)
iblmLexer lexer = new iblmLexer(stream)
TokenStream tokenStream = new CommonTokenStream(lexer)
iblmParser parser = new iblmParser(tokenStream)
Systemoutprintln(FILE + filePath)
Systemoutprintln(Successfully parsed = + parseriblm())
Figure 7 A Java Application for Parsing an ILBM File
FILE CUserssl170PicturesIFF-ilbmBLKIFF
ID ILBM Size 2556
ID BMHD Size 20
Width 200
Height 144
X Position 0
Y Position 0
Number of Panes 6
Masking0
Compression1
Reserved10
TransParent Color0
X Aspest3
Y Aspect3
Page Width200
Page Height144
ID CMAP Size 48
Color[0] 0 0 0
Color[1] 0 128 128
Color[2] 0 255 255
Color[3] 255 0 255
Color[4] 0 255 0
Color[5] 255 0 0
Color[6] 0 0 255
Color[7] 255 255 0
Color[8] 128 128 128
Color[9] 192 192 192
Color[10] 128 0 0
Color[11] 0 128 0
Color[12] 128 128 0
Color[13] 0 0 128
Color[14] 128 0 128
18
Color[15] 255 255 255
ID CAMG Size 4
View Modes 0x800
ID BODY Size 2448
Number of bytes 2448
Successfully parsed = true
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF
63 An ANTLR Grammar for the RIFF Waveform PCM File Format
Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR
The output of ANTLR is a lexer and parser for WAVEFOR PCM files
grammar waveform_pcm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text contained in lexerheader is placed at the top of the lexer
file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Default ByteOrder to Little Endian
String byteOrder = II
Rules used to create the Parser
start symbol waveform_pcm
waveform_pcm returns [boolean result]
19
ckId = fourCC $ckIdvalueequals(RIFF)
ckSize = dWordValue
formType = fourCC $formTypevalueequals(WAVE)
formatSubChunk
dataSubChunk
$result = true
fourCC returns [String value]
BYTE BYTE BYTE BYTE $value = $text
formatSubChunk
subChunkId = fourCC $subChunkIdvalueequals(fmt )
ckSize = dWordValue
commonFields
pcmSpecificField
commonFields
formatTag = wordValue Format category
channels = wordValue Number of channels
samplesPerSec = dWordValue Sampling rate
avgBytesPerSec = dWordValue For buffer estimation
blockAlign = wordValue Data block size
pcmSpecificField
bitsPerSample = wordValue Sample size
dataSubChunk
subChunkId = fourCC $subChunkIdvalueequals(data)
ckSize = dWordValue
waveData[$ckSizevalue]
waveData [long size]
(size gt= 0=gt BYTE size--)+
defines the long data type in terms of BYTE terminals
dWordValue returns [long value] throws UnsupportedEncodingException
b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE
if (byteOrderequals(II))
$value = (long)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 |
getUByteValue($b3) ltlt 16 |
getUByteValue($b4) ltlt 24 )
else
$value = (long)
( getUByteValue($b1) ltlt 24 |
getUByteValue($b2) ltlt 16 |
20
getUByteValue($b3) ltlt 8 |
getUByteValue($b4) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
if (byteOrderequals(II))
$value = (int)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 )
else
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rule used to create the Lexer
BYTE
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format
7 Related Research
Researchers have developed a number of data description languages to support validation of file
formats EAST (Enhanced ADA Subset) is a data description language developed by the
Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary
Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic
information The EAST description is used to interpret and provide access to information in
binary and text files
DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to
manipulate Java jar files and ELF object files The PADSC data description language can be
used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC
compiler compiles the description into tools that can be used to recognize manipulate and
transform the data into other formats
21
The Data Format Description Language (DFDL) is being developed by a Working Group of the
Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML
Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-
XML physical representations Version 1 of the language specification was published in 2011
and a parser is being implemented
Each of the data description languages described above can be used to define data types and file
structures of binary files However they are based on file structure declarations such as those of
C or Java The binary file format grammar described in this paper most closely resembles DFDL
However the binary file grammar described in this paper is the only data description language
based on formal grammars that is used for creating recognizers for file formats
JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to
provide automated and efficient identification and validation of the formats of digital files
JHOVE is a format-specific digital object validation API written in Java JHOVE supports
validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF
UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California
digital Library 2011]
The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation
of binary file formats As a matter of fact binary file grammars and parsers have been
constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-
based format TIFF However the research reported herein differs from that of the JHOVE
project in that what is sought is a technology for generating validators for binary file formats
from grammars specifying the binary file format
8 Conclusion
The research question addressed by his research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files Amajor family of binary file formats chunk-based was described Then extensions
to the concepts of context-free grammars and attribute grammars were described that enable the
specification of binary file formats Examples of attribute array grammars for two chunk-based
binary file format were then presented Recursive descent parsers for this classes of grammars
was then described Experience in using ANTLR a parser generator for LL() string grammars
in generating parsers for recognizing the formats of binary files was discussed Finally related
research in data description languages for binary file formats was described and the JHOVE
project in creating Java-based tools for validating file formats was described
22
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats Furthermore these grammars can be used with recursive descent
parsers for validating the file formats of chunk-based binary files It remains to be determined
whether these attribute grammars based on binary file (array) grammars are adequate to define
other families of binary file formats
To be a practical technology for generating recognizers (validators) for binary file formats a
parser generator is needed that creates from a binary file grammar a recursive descent parser
(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and
generates a lexer as well as a parser is too heavyweight a parser generator for this task and does
not have the requisite data types built in
A capability to validate binary file formats assumes that the file format of a file is already
known In this research we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009] We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendix a We also have file
signature tests (magic tests) for all these file types
We would like to have a parser generator (also called a compilerndashcompiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents The
closest parser generator that we have found to having these capabilities is ANTLR
(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar However it works with strings and generates a separate scanner
Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based
file formats We have specifications for each of file formats and samples of files with these
formats
These file formats have a similar structure We have developed binary file grammars for two of
these binary file formats We should be able to do so for all of them We used these grammars
with the extensions to ANTLR to generate top-down recursive descent parsers for the file
formats defined with these grammars
The significance of success in this research task is that if binary file formats can be specified
with binary file grammars then only one parsing generator is needed to generate the parsers for
verifying that a file conforms to a particular format Similarly the same parsing generator can be
used for conversion of legacy file formats to current or standard formats Finally the same parser
generator can be used for generating viewerplayers for most file formats This would increase
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
10
ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt
ltdwSampleLengthDWORDgt
ltdwPurposeDWORDgt
ltwCountryWORDgt
ltwLanguageWORDgt
ltwDialectWORDgt
ltwCodePageWORDgt
ltdataBYTEgt+
ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt
ltfileDataBYTEgt+
Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format
5 Recursive Descent Parsers for Binary File Grammars
A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)
procedures where each procedure implements one of the production rules of the grammar A
predictive parser is a recursive descent parser that does not require backtracking Predictive
parsing is possible only for the class of LL(k) grammars which are the context-free grammars
for which there exists some positive integer k that allows a recursive descent parser to decide
which production to use by examining only the next k tokens of input A recursive descent parser
for the binary file format grammars defined in this paper does not require a separate lexer to
identify the data of a file The data types are predicted by the grammar rules
The semantic rules of an L-attributed grammar can be included in the recursive descent parser to
check for context-sensitive aspects of the grammar such as chunk size or image dimensions and
use these values in parsing the chunk data or images The pointers to file addresses that occur in
the file formats are handled by pushing the current address in the file onto a pushdown stack
When the parser exhausts the procedures implementing production rules corresponding to the
pointed to location the topmost address on the pushdown stack is popped and the procedure
corresponding to the nonterminal at that position is executed
Recursive descent parsers with a pushdown stack can be used with binary file format grammars
to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads
a file and generates an error if the file does not conform to the syntax specified by the grammar
11
6 A Parser Generator for Recognizing Binary File Formats
What we want is a parser generator whose input is a binary file attribute grammar for a particular
binary file format and whose generated output is the Java source code of a recursive descent
parser (with a pushdown stack if needed) for the class of binary file formats specified by the
grammar Such a parser generator does not yet exist but there is a widely used parser generator
for attribute grammars for string-based languages
61 ANTLR
ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing
[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a
language and generates as output source code for a recognizer for that language ANTLR
supports generating code in a number of the programming languages including C Java
JavaScript and Python
ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that
converts a sequence of characters into tokens A lexer is not needed to parse binary file formats
because the data types predicted by the parser are the tokens of the grammar Furthermore the
capability to recognize LL() grammars is not needed because the binary file format grammars
specified so far do not require lookahead of k symbols
62 Using ANTLR to Generate a Parser for a Binary File Format
It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-
based and directory-based binary file formats This is accomplished by writing a lexer that treats
each input byte in a file as a character token Then functions are created for each data type in the
binary file grammar that convert the appropriate number of character tokens to binary data types
eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our
attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a
parser generator for binary file grammars
621 An ANTLR Grammar for an ILBM Chunk-based File Format
Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of
ANTLR is a lexer and parser for ILBM files
12
grammar iblm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in lexerheader is placed at the top of the lexer file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Grammar Rules used to create the Parser
Start symbol ilbm
iblm returns [boolean result]
FORM ckSize ILBM
Print out ILBM chunksize
Systemoutprintln(ID ILBM Size + $ckSizevalue)
Semantic predicate that ensures that at least on BMHD is found
propertyChunks $propertyChunksfoundBMHD == true
dataChunks
body
$result = true
The bmhd cmap and camg property chunks can occur in any order
At least one BMHD data chunk must be present
propertyChunks returns [boolean foundBMHD]
(bmhd $foundBMHD = true
| cmap
| camg)+
bmhd
BMHD ckSize
Systemoutprintln(ID BMHD Size + $ckSizevalue)
13
bitMapHeader
bitMapHeader
width = uWordValue
height = uWordValue
xPosition = wordValue
yPosition = wordValue
nPanes = uByteValue
masking = uByteValue
compression = uByteValue
reserved1 = uByteValue
transparentColor = uWordValue
xAspect = uByteValue
yAspect = uByteValue
pageWidth = wordValue
pageHeight = wordValue
Print out values in the bitmap structure
Systemoutprintln( Width + $widthvalue)
Systemoutprintln( Height + $heightvalue)
Systemoutprintln( X Position + $xPositionvalue)
Systemoutprintln( Y Position + $yPositionvalue)
Systemoutprintln( Number of Panes + $nPanesvalue)
Systemoutprintln( Masking + $maskingvalue)
Systemoutprintln( Compression + $compressionvalue)
Systemoutprintln( Reserved1 + $reserved1value)
Systemoutprintln( TransParent Color +
$transparentColorvalue)
Systemoutprintln( X Aspest + $xAspectvalue)
Systemoutprintln( Y Aspect + $yAspectvalue)
Systemoutprintln( Page Width + $pageWidthvalue)
Systemoutprintln( Page Height + $pageHeightvalue)
cmap returns [long value]
Initialize the index value i in the cmap function
init int i = 0
CMAP ckSize
Print out chunk size
$value = $ckSizevalue
Systemoutprintln(ID CMAP Size + $ckSizevalue)
(colorRegister
Print out color values
Systemoutprintln( Color[ + i++ + ] +
$colorRegisterredValue +
+ $colorRegistergreenValue +
+ $colorRegisterblueValue +
)
)+
colorRegister returns [int redValue int greenValue int blueValue]
14
red = uByteValue green = uByteValue blue = uByteValue
$redValue = $redvalue
$greenValue = $greenvalue
$blueValue = $bluevalue
camg
CAMG ckSize longValue
Systemoutprintln(ID CAMG Size + $ckSizevalue)
Systemoutprintln( View Modes 0x +
LongtoHexString($longValuevalue))
The crng and ccrt data chunks
dataChunks
crng
| ccrt
crng
CRNG ckSize
Systemoutprintln(ID CRNG Size + $ckSizevalue)
cRange
cRange
pad1 = wordValue $pad1value == 0
rate = wordValue
active = wordValue
low = uByteValue
high = uByteValue
ccrt
CCRT ckSize
Systemoutprintln(ID CCRT Size + $ckSizevalue)
cycleInfo
cycleInfo
direction = wordValue
start = uByteValue
end = uByteValue
seconds = longValue
microseconds = longValue
pad = wordValue $padvalue == 0
body
BODY ckSize
Systemoutprintln(ID BODY Size + $ckSizevalue)
$ckSizevalue gt 0 data
Systemoutprintln( Number of bytes + $datavalue)
$datavalue == $ckSizevalue
defines the data chunk of the body chunk
data returns [long value]
(b+=BYTE)+ $bsize() gt 0
$value = $bsize()
15
The cksize is the value of a long integer
ckSize returns [long value]
longValue
$value = $longValuevalue
defines the longValue data typein terms of BYTE terminals
longValue returns [long value] throws UnsupportedEncodingException
h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE
$value = (long)
( getUByteValue($h1) ltlt 24 |
getUByteValue($h2) ltlt 16 |
getUByteValue($l1) ltlt 8 |
getUByteValue($l2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uWordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (short)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uByteValue returns [int value]
BYTE
calls the getUByteValue from a BYTE token
16
$value = (int) getUByteValue($BYTE)
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rules used to create the Lexer
FORM FORM
ILBM ILBM
BMHD BMHD
CMAP CMAP
CAMG CAMG
CRNG CRNG
CCRT CCRT
BODY BODY
BYTE
Figure 6 An ANTLR Grammar for the ILBM File Format
622 The Lexer and Parser
When provided the file described in the previous section ANTLR generates the ilbmlexer and
ilbmparser Java programs The Java application shown in Figure 6 uses these programs to
validate the ILBM file BLKIFF It produces the output shown in Figure 7
package orggtrigrammarschunk
import javaioIOException
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRStringStream
import organtlrruntimeCharStream
import organtlrruntimeCommonTokenStream
import organtlrruntimeRecognitionException
17
import organtlrruntimeTokenStream
import gtriorggrammarschunkiblmLexer
import gtriorggrammarschunkiblmParser
public class Test
static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF
public static void main(String[] args) throws RecognitionException IOException
ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)
iblmLexer lexer = new iblmLexer(stream)
TokenStream tokenStream = new CommonTokenStream(lexer)
iblmParser parser = new iblmParser(tokenStream)
Systemoutprintln(FILE + filePath)
Systemoutprintln(Successfully parsed = + parseriblm())
Figure 7 A Java Application for Parsing an ILBM File
FILE CUserssl170PicturesIFF-ilbmBLKIFF
ID ILBM Size 2556
ID BMHD Size 20
Width 200
Height 144
X Position 0
Y Position 0
Number of Panes 6
Masking0
Compression1
Reserved10
TransParent Color0
X Aspest3
Y Aspect3
Page Width200
Page Height144
ID CMAP Size 48
Color[0] 0 0 0
Color[1] 0 128 128
Color[2] 0 255 255
Color[3] 255 0 255
Color[4] 0 255 0
Color[5] 255 0 0
Color[6] 0 0 255
Color[7] 255 255 0
Color[8] 128 128 128
Color[9] 192 192 192
Color[10] 128 0 0
Color[11] 0 128 0
Color[12] 128 128 0
Color[13] 0 0 128
Color[14] 128 0 128
18
Color[15] 255 255 255
ID CAMG Size 4
View Modes 0x800
ID BODY Size 2448
Number of bytes 2448
Successfully parsed = true
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF
63 An ANTLR Grammar for the RIFF Waveform PCM File Format
Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR
The output of ANTLR is a lexer and parser for WAVEFOR PCM files
grammar waveform_pcm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text contained in lexerheader is placed at the top of the lexer
file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Default ByteOrder to Little Endian
String byteOrder = II
Rules used to create the Parser
start symbol waveform_pcm
waveform_pcm returns [boolean result]
19
ckId = fourCC $ckIdvalueequals(RIFF)
ckSize = dWordValue
formType = fourCC $formTypevalueequals(WAVE)
formatSubChunk
dataSubChunk
$result = true
fourCC returns [String value]
BYTE BYTE BYTE BYTE $value = $text
formatSubChunk
subChunkId = fourCC $subChunkIdvalueequals(fmt )
ckSize = dWordValue
commonFields
pcmSpecificField
commonFields
formatTag = wordValue Format category
channels = wordValue Number of channels
samplesPerSec = dWordValue Sampling rate
avgBytesPerSec = dWordValue For buffer estimation
blockAlign = wordValue Data block size
pcmSpecificField
bitsPerSample = wordValue Sample size
dataSubChunk
subChunkId = fourCC $subChunkIdvalueequals(data)
ckSize = dWordValue
waveData[$ckSizevalue]
waveData [long size]
(size gt= 0=gt BYTE size--)+
defines the long data type in terms of BYTE terminals
dWordValue returns [long value] throws UnsupportedEncodingException
b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE
if (byteOrderequals(II))
$value = (long)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 |
getUByteValue($b3) ltlt 16 |
getUByteValue($b4) ltlt 24 )
else
$value = (long)
( getUByteValue($b1) ltlt 24 |
getUByteValue($b2) ltlt 16 |
20
getUByteValue($b3) ltlt 8 |
getUByteValue($b4) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
if (byteOrderequals(II))
$value = (int)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 )
else
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rule used to create the Lexer
BYTE
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format
7 Related Research
Researchers have developed a number of data description languages to support validation of file
formats EAST (Enhanced ADA Subset) is a data description language developed by the
Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary
Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic
information The EAST description is used to interpret and provide access to information in
binary and text files
DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to
manipulate Java jar files and ELF object files The PADSC data description language can be
used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC
compiler compiles the description into tools that can be used to recognize manipulate and
transform the data into other formats
21
The Data Format Description Language (DFDL) is being developed by a Working Group of the
Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML
Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-
XML physical representations Version 1 of the language specification was published in 2011
and a parser is being implemented
Each of the data description languages described above can be used to define data types and file
structures of binary files However they are based on file structure declarations such as those of
C or Java The binary file format grammar described in this paper most closely resembles DFDL
However the binary file grammar described in this paper is the only data description language
based on formal grammars that is used for creating recognizers for file formats
JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to
provide automated and efficient identification and validation of the formats of digital files
JHOVE is a format-specific digital object validation API written in Java JHOVE supports
validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF
UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California
digital Library 2011]
The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation
of binary file formats As a matter of fact binary file grammars and parsers have been
constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-
based format TIFF However the research reported herein differs from that of the JHOVE
project in that what is sought is a technology for generating validators for binary file formats
from grammars specifying the binary file format
8 Conclusion
The research question addressed by his research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files Amajor family of binary file formats chunk-based was described Then extensions
to the concepts of context-free grammars and attribute grammars were described that enable the
specification of binary file formats Examples of attribute array grammars for two chunk-based
binary file format were then presented Recursive descent parsers for this classes of grammars
was then described Experience in using ANTLR a parser generator for LL() string grammars
in generating parsers for recognizing the formats of binary files was discussed Finally related
research in data description languages for binary file formats was described and the JHOVE
project in creating Java-based tools for validating file formats was described
22
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats Furthermore these grammars can be used with recursive descent
parsers for validating the file formats of chunk-based binary files It remains to be determined
whether these attribute grammars based on binary file (array) grammars are adequate to define
other families of binary file formats
To be a practical technology for generating recognizers (validators) for binary file formats a
parser generator is needed that creates from a binary file grammar a recursive descent parser
(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and
generates a lexer as well as a parser is too heavyweight a parser generator for this task and does
not have the requisite data types built in
A capability to validate binary file formats assumes that the file format of a file is already
known In this research we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009] We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendix a We also have file
signature tests (magic tests) for all these file types
We would like to have a parser generator (also called a compilerndashcompiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents The
closest parser generator that we have found to having these capabilities is ANTLR
(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar However it works with strings and generates a separate scanner
Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based
file formats We have specifications for each of file formats and samples of files with these
formats
These file formats have a similar structure We have developed binary file grammars for two of
these binary file formats We should be able to do so for all of them We used these grammars
with the extensions to ANTLR to generate top-down recursive descent parsers for the file
formats defined with these grammars
The significance of success in this research task is that if binary file formats can be specified
with binary file grammars then only one parsing generator is needed to generate the parsers for
verifying that a file conforms to a particular format Similarly the same parsing generator can be
used for conversion of legacy file formats to current or standard formats Finally the same parser
generator can be used for generating viewerplayers for most file formats This would increase
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
11
6 A Parser Generator for Recognizing Binary File Formats
What we want is a parser generator whose input is a binary file attribute grammar for a particular
binary file format and whose generated output is the Java source code of a recursive descent
parser (with a pushdown stack if needed) for the class of binary file formats specified by the
grammar Such a parser generator does not yet exist but there is a widely used parser generator
for attribute grammars for string-based languages
61 ANTLR
ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing
[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a
language and generates as output source code for a recognizer for that language ANTLR
supports generating code in a number of the programming languages including C Java
JavaScript and Python
ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that
converts a sequence of characters into tokens A lexer is not needed to parse binary file formats
because the data types predicted by the parser are the tokens of the grammar Furthermore the
capability to recognize LL() grammars is not needed because the binary file format grammars
specified so far do not require lookahead of k symbols
62 Using ANTLR to Generate a Parser for a Binary File Format
It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-
based and directory-based binary file formats This is accomplished by writing a lexer that treats
each input byte in a file as a character token Then functions are created for each data type in the
binary file grammar that convert the appropriate number of character tokens to binary data types
eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our
attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a
parser generator for binary file grammars
621 An ANTLR Grammar for an ILBM Chunk-based File Format
Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of
ANTLR is a lexer and parser for ILBM files
12
grammar iblm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in lexerheader is placed at the top of the lexer file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Grammar Rules used to create the Parser
Start symbol ilbm
iblm returns [boolean result]
FORM ckSize ILBM
Print out ILBM chunksize
Systemoutprintln(ID ILBM Size + $ckSizevalue)
Semantic predicate that ensures that at least on BMHD is found
propertyChunks $propertyChunksfoundBMHD == true
dataChunks
body
$result = true
The bmhd cmap and camg property chunks can occur in any order
At least one BMHD data chunk must be present
propertyChunks returns [boolean foundBMHD]
(bmhd $foundBMHD = true
| cmap
| camg)+
bmhd
BMHD ckSize
Systemoutprintln(ID BMHD Size + $ckSizevalue)
13
bitMapHeader
bitMapHeader
width = uWordValue
height = uWordValue
xPosition = wordValue
yPosition = wordValue
nPanes = uByteValue
masking = uByteValue
compression = uByteValue
reserved1 = uByteValue
transparentColor = uWordValue
xAspect = uByteValue
yAspect = uByteValue
pageWidth = wordValue
pageHeight = wordValue
Print out values in the bitmap structure
Systemoutprintln( Width + $widthvalue)
Systemoutprintln( Height + $heightvalue)
Systemoutprintln( X Position + $xPositionvalue)
Systemoutprintln( Y Position + $yPositionvalue)
Systemoutprintln( Number of Panes + $nPanesvalue)
Systemoutprintln( Masking + $maskingvalue)
Systemoutprintln( Compression + $compressionvalue)
Systemoutprintln( Reserved1 + $reserved1value)
Systemoutprintln( TransParent Color +
$transparentColorvalue)
Systemoutprintln( X Aspest + $xAspectvalue)
Systemoutprintln( Y Aspect + $yAspectvalue)
Systemoutprintln( Page Width + $pageWidthvalue)
Systemoutprintln( Page Height + $pageHeightvalue)
cmap returns [long value]
Initialize the index value i in the cmap function
init int i = 0
CMAP ckSize
Print out chunk size
$value = $ckSizevalue
Systemoutprintln(ID CMAP Size + $ckSizevalue)
(colorRegister
Print out color values
Systemoutprintln( Color[ + i++ + ] +
$colorRegisterredValue +
+ $colorRegistergreenValue +
+ $colorRegisterblueValue +
)
)+
colorRegister returns [int redValue int greenValue int blueValue]
14
red = uByteValue green = uByteValue blue = uByteValue
$redValue = $redvalue
$greenValue = $greenvalue
$blueValue = $bluevalue
camg
CAMG ckSize longValue
Systemoutprintln(ID CAMG Size + $ckSizevalue)
Systemoutprintln( View Modes 0x +
LongtoHexString($longValuevalue))
The crng and ccrt data chunks
dataChunks
crng
| ccrt
crng
CRNG ckSize
Systemoutprintln(ID CRNG Size + $ckSizevalue)
cRange
cRange
pad1 = wordValue $pad1value == 0
rate = wordValue
active = wordValue
low = uByteValue
high = uByteValue
ccrt
CCRT ckSize
Systemoutprintln(ID CCRT Size + $ckSizevalue)
cycleInfo
cycleInfo
direction = wordValue
start = uByteValue
end = uByteValue
seconds = longValue
microseconds = longValue
pad = wordValue $padvalue == 0
body
BODY ckSize
Systemoutprintln(ID BODY Size + $ckSizevalue)
$ckSizevalue gt 0 data
Systemoutprintln( Number of bytes + $datavalue)
$datavalue == $ckSizevalue
defines the data chunk of the body chunk
data returns [long value]
(b+=BYTE)+ $bsize() gt 0
$value = $bsize()
15
The cksize is the value of a long integer
ckSize returns [long value]
longValue
$value = $longValuevalue
defines the longValue data typein terms of BYTE terminals
longValue returns [long value] throws UnsupportedEncodingException
h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE
$value = (long)
( getUByteValue($h1) ltlt 24 |
getUByteValue($h2) ltlt 16 |
getUByteValue($l1) ltlt 8 |
getUByteValue($l2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uWordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (short)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uByteValue returns [int value]
BYTE
calls the getUByteValue from a BYTE token
16
$value = (int) getUByteValue($BYTE)
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rules used to create the Lexer
FORM FORM
ILBM ILBM
BMHD BMHD
CMAP CMAP
CAMG CAMG
CRNG CRNG
CCRT CCRT
BODY BODY
BYTE
Figure 6 An ANTLR Grammar for the ILBM File Format
622 The Lexer and Parser
When provided the file described in the previous section ANTLR generates the ilbmlexer and
ilbmparser Java programs The Java application shown in Figure 6 uses these programs to
validate the ILBM file BLKIFF It produces the output shown in Figure 7
package orggtrigrammarschunk
import javaioIOException
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRStringStream
import organtlrruntimeCharStream
import organtlrruntimeCommonTokenStream
import organtlrruntimeRecognitionException
17
import organtlrruntimeTokenStream
import gtriorggrammarschunkiblmLexer
import gtriorggrammarschunkiblmParser
public class Test
static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF
public static void main(String[] args) throws RecognitionException IOException
ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)
iblmLexer lexer = new iblmLexer(stream)
TokenStream tokenStream = new CommonTokenStream(lexer)
iblmParser parser = new iblmParser(tokenStream)
Systemoutprintln(FILE + filePath)
Systemoutprintln(Successfully parsed = + parseriblm())
Figure 7 A Java Application for Parsing an ILBM File
FILE CUserssl170PicturesIFF-ilbmBLKIFF
ID ILBM Size 2556
ID BMHD Size 20
Width 200
Height 144
X Position 0
Y Position 0
Number of Panes 6
Masking0
Compression1
Reserved10
TransParent Color0
X Aspest3
Y Aspect3
Page Width200
Page Height144
ID CMAP Size 48
Color[0] 0 0 0
Color[1] 0 128 128
Color[2] 0 255 255
Color[3] 255 0 255
Color[4] 0 255 0
Color[5] 255 0 0
Color[6] 0 0 255
Color[7] 255 255 0
Color[8] 128 128 128
Color[9] 192 192 192
Color[10] 128 0 0
Color[11] 0 128 0
Color[12] 128 128 0
Color[13] 0 0 128
Color[14] 128 0 128
18
Color[15] 255 255 255
ID CAMG Size 4
View Modes 0x800
ID BODY Size 2448
Number of bytes 2448
Successfully parsed = true
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF
63 An ANTLR Grammar for the RIFF Waveform PCM File Format
Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR
The output of ANTLR is a lexer and parser for WAVEFOR PCM files
grammar waveform_pcm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text contained in lexerheader is placed at the top of the lexer
file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Default ByteOrder to Little Endian
String byteOrder = II
Rules used to create the Parser
start symbol waveform_pcm
waveform_pcm returns [boolean result]
19
ckId = fourCC $ckIdvalueequals(RIFF)
ckSize = dWordValue
formType = fourCC $formTypevalueequals(WAVE)
formatSubChunk
dataSubChunk
$result = true
fourCC returns [String value]
BYTE BYTE BYTE BYTE $value = $text
formatSubChunk
subChunkId = fourCC $subChunkIdvalueequals(fmt )
ckSize = dWordValue
commonFields
pcmSpecificField
commonFields
formatTag = wordValue Format category
channels = wordValue Number of channels
samplesPerSec = dWordValue Sampling rate
avgBytesPerSec = dWordValue For buffer estimation
blockAlign = wordValue Data block size
pcmSpecificField
bitsPerSample = wordValue Sample size
dataSubChunk
subChunkId = fourCC $subChunkIdvalueequals(data)
ckSize = dWordValue
waveData[$ckSizevalue]
waveData [long size]
(size gt= 0=gt BYTE size--)+
defines the long data type in terms of BYTE terminals
dWordValue returns [long value] throws UnsupportedEncodingException
b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE
if (byteOrderequals(II))
$value = (long)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 |
getUByteValue($b3) ltlt 16 |
getUByteValue($b4) ltlt 24 )
else
$value = (long)
( getUByteValue($b1) ltlt 24 |
getUByteValue($b2) ltlt 16 |
20
getUByteValue($b3) ltlt 8 |
getUByteValue($b4) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
if (byteOrderequals(II))
$value = (int)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 )
else
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rule used to create the Lexer
BYTE
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format
7 Related Research
Researchers have developed a number of data description languages to support validation of file
formats EAST (Enhanced ADA Subset) is a data description language developed by the
Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary
Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic
information The EAST description is used to interpret and provide access to information in
binary and text files
DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to
manipulate Java jar files and ELF object files The PADSC data description language can be
used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC
compiler compiles the description into tools that can be used to recognize manipulate and
transform the data into other formats
21
The Data Format Description Language (DFDL) is being developed by a Working Group of the
Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML
Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-
XML physical representations Version 1 of the language specification was published in 2011
and a parser is being implemented
Each of the data description languages described above can be used to define data types and file
structures of binary files However they are based on file structure declarations such as those of
C or Java The binary file format grammar described in this paper most closely resembles DFDL
However the binary file grammar described in this paper is the only data description language
based on formal grammars that is used for creating recognizers for file formats
JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to
provide automated and efficient identification and validation of the formats of digital files
JHOVE is a format-specific digital object validation API written in Java JHOVE supports
validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF
UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California
digital Library 2011]
The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation
of binary file formats As a matter of fact binary file grammars and parsers have been
constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-
based format TIFF However the research reported herein differs from that of the JHOVE
project in that what is sought is a technology for generating validators for binary file formats
from grammars specifying the binary file format
8 Conclusion
The research question addressed by his research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files Amajor family of binary file formats chunk-based was described Then extensions
to the concepts of context-free grammars and attribute grammars were described that enable the
specification of binary file formats Examples of attribute array grammars for two chunk-based
binary file format were then presented Recursive descent parsers for this classes of grammars
was then described Experience in using ANTLR a parser generator for LL() string grammars
in generating parsers for recognizing the formats of binary files was discussed Finally related
research in data description languages for binary file formats was described and the JHOVE
project in creating Java-based tools for validating file formats was described
22
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats Furthermore these grammars can be used with recursive descent
parsers for validating the file formats of chunk-based binary files It remains to be determined
whether these attribute grammars based on binary file (array) grammars are adequate to define
other families of binary file formats
To be a practical technology for generating recognizers (validators) for binary file formats a
parser generator is needed that creates from a binary file grammar a recursive descent parser
(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and
generates a lexer as well as a parser is too heavyweight a parser generator for this task and does
not have the requisite data types built in
A capability to validate binary file formats assumes that the file format of a file is already
known In this research we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009] We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendix a We also have file
signature tests (magic tests) for all these file types
We would like to have a parser generator (also called a compilerndashcompiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents The
closest parser generator that we have found to having these capabilities is ANTLR
(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar However it works with strings and generates a separate scanner
Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based
file formats We have specifications for each of file formats and samples of files with these
formats
These file formats have a similar structure We have developed binary file grammars for two of
these binary file formats We should be able to do so for all of them We used these grammars
with the extensions to ANTLR to generate top-down recursive descent parsers for the file
formats defined with these grammars
The significance of success in this research task is that if binary file formats can be specified
with binary file grammars then only one parsing generator is needed to generate the parsers for
verifying that a file conforms to a particular format Similarly the same parsing generator can be
used for conversion of legacy file formats to current or standard formats Finally the same parser
generator can be used for generating viewerplayers for most file formats This would increase
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
12
grammar iblm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in lexerheader is placed at the top of the lexer file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Grammar Rules used to create the Parser
Start symbol ilbm
iblm returns [boolean result]
FORM ckSize ILBM
Print out ILBM chunksize
Systemoutprintln(ID ILBM Size + $ckSizevalue)
Semantic predicate that ensures that at least on BMHD is found
propertyChunks $propertyChunksfoundBMHD == true
dataChunks
body
$result = true
The bmhd cmap and camg property chunks can occur in any order
At least one BMHD data chunk must be present
propertyChunks returns [boolean foundBMHD]
(bmhd $foundBMHD = true
| cmap
| camg)+
bmhd
BMHD ckSize
Systemoutprintln(ID BMHD Size + $ckSizevalue)
13
bitMapHeader
bitMapHeader
width = uWordValue
height = uWordValue
xPosition = wordValue
yPosition = wordValue
nPanes = uByteValue
masking = uByteValue
compression = uByteValue
reserved1 = uByteValue
transparentColor = uWordValue
xAspect = uByteValue
yAspect = uByteValue
pageWidth = wordValue
pageHeight = wordValue
Print out values in the bitmap structure
Systemoutprintln( Width + $widthvalue)
Systemoutprintln( Height + $heightvalue)
Systemoutprintln( X Position + $xPositionvalue)
Systemoutprintln( Y Position + $yPositionvalue)
Systemoutprintln( Number of Panes + $nPanesvalue)
Systemoutprintln( Masking + $maskingvalue)
Systemoutprintln( Compression + $compressionvalue)
Systemoutprintln( Reserved1 + $reserved1value)
Systemoutprintln( TransParent Color +
$transparentColorvalue)
Systemoutprintln( X Aspest + $xAspectvalue)
Systemoutprintln( Y Aspect + $yAspectvalue)
Systemoutprintln( Page Width + $pageWidthvalue)
Systemoutprintln( Page Height + $pageHeightvalue)
cmap returns [long value]
Initialize the index value i in the cmap function
init int i = 0
CMAP ckSize
Print out chunk size
$value = $ckSizevalue
Systemoutprintln(ID CMAP Size + $ckSizevalue)
(colorRegister
Print out color values
Systemoutprintln( Color[ + i++ + ] +
$colorRegisterredValue +
+ $colorRegistergreenValue +
+ $colorRegisterblueValue +
)
)+
colorRegister returns [int redValue int greenValue int blueValue]
14
red = uByteValue green = uByteValue blue = uByteValue
$redValue = $redvalue
$greenValue = $greenvalue
$blueValue = $bluevalue
camg
CAMG ckSize longValue
Systemoutprintln(ID CAMG Size + $ckSizevalue)
Systemoutprintln( View Modes 0x +
LongtoHexString($longValuevalue))
The crng and ccrt data chunks
dataChunks
crng
| ccrt
crng
CRNG ckSize
Systemoutprintln(ID CRNG Size + $ckSizevalue)
cRange
cRange
pad1 = wordValue $pad1value == 0
rate = wordValue
active = wordValue
low = uByteValue
high = uByteValue
ccrt
CCRT ckSize
Systemoutprintln(ID CCRT Size + $ckSizevalue)
cycleInfo
cycleInfo
direction = wordValue
start = uByteValue
end = uByteValue
seconds = longValue
microseconds = longValue
pad = wordValue $padvalue == 0
body
BODY ckSize
Systemoutprintln(ID BODY Size + $ckSizevalue)
$ckSizevalue gt 0 data
Systemoutprintln( Number of bytes + $datavalue)
$datavalue == $ckSizevalue
defines the data chunk of the body chunk
data returns [long value]
(b+=BYTE)+ $bsize() gt 0
$value = $bsize()
15
The cksize is the value of a long integer
ckSize returns [long value]
longValue
$value = $longValuevalue
defines the longValue data typein terms of BYTE terminals
longValue returns [long value] throws UnsupportedEncodingException
h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE
$value = (long)
( getUByteValue($h1) ltlt 24 |
getUByteValue($h2) ltlt 16 |
getUByteValue($l1) ltlt 8 |
getUByteValue($l2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uWordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (short)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uByteValue returns [int value]
BYTE
calls the getUByteValue from a BYTE token
16
$value = (int) getUByteValue($BYTE)
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rules used to create the Lexer
FORM FORM
ILBM ILBM
BMHD BMHD
CMAP CMAP
CAMG CAMG
CRNG CRNG
CCRT CCRT
BODY BODY
BYTE
Figure 6 An ANTLR Grammar for the ILBM File Format
622 The Lexer and Parser
When provided the file described in the previous section ANTLR generates the ilbmlexer and
ilbmparser Java programs The Java application shown in Figure 6 uses these programs to
validate the ILBM file BLKIFF It produces the output shown in Figure 7
package orggtrigrammarschunk
import javaioIOException
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRStringStream
import organtlrruntimeCharStream
import organtlrruntimeCommonTokenStream
import organtlrruntimeRecognitionException
17
import organtlrruntimeTokenStream
import gtriorggrammarschunkiblmLexer
import gtriorggrammarschunkiblmParser
public class Test
static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF
public static void main(String[] args) throws RecognitionException IOException
ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)
iblmLexer lexer = new iblmLexer(stream)
TokenStream tokenStream = new CommonTokenStream(lexer)
iblmParser parser = new iblmParser(tokenStream)
Systemoutprintln(FILE + filePath)
Systemoutprintln(Successfully parsed = + parseriblm())
Figure 7 A Java Application for Parsing an ILBM File
FILE CUserssl170PicturesIFF-ilbmBLKIFF
ID ILBM Size 2556
ID BMHD Size 20
Width 200
Height 144
X Position 0
Y Position 0
Number of Panes 6
Masking0
Compression1
Reserved10
TransParent Color0
X Aspest3
Y Aspect3
Page Width200
Page Height144
ID CMAP Size 48
Color[0] 0 0 0
Color[1] 0 128 128
Color[2] 0 255 255
Color[3] 255 0 255
Color[4] 0 255 0
Color[5] 255 0 0
Color[6] 0 0 255
Color[7] 255 255 0
Color[8] 128 128 128
Color[9] 192 192 192
Color[10] 128 0 0
Color[11] 0 128 0
Color[12] 128 128 0
Color[13] 0 0 128
Color[14] 128 0 128
18
Color[15] 255 255 255
ID CAMG Size 4
View Modes 0x800
ID BODY Size 2448
Number of bytes 2448
Successfully parsed = true
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF
63 An ANTLR Grammar for the RIFF Waveform PCM File Format
Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR
The output of ANTLR is a lexer and parser for WAVEFOR PCM files
grammar waveform_pcm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text contained in lexerheader is placed at the top of the lexer
file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Default ByteOrder to Little Endian
String byteOrder = II
Rules used to create the Parser
start symbol waveform_pcm
waveform_pcm returns [boolean result]
19
ckId = fourCC $ckIdvalueequals(RIFF)
ckSize = dWordValue
formType = fourCC $formTypevalueequals(WAVE)
formatSubChunk
dataSubChunk
$result = true
fourCC returns [String value]
BYTE BYTE BYTE BYTE $value = $text
formatSubChunk
subChunkId = fourCC $subChunkIdvalueequals(fmt )
ckSize = dWordValue
commonFields
pcmSpecificField
commonFields
formatTag = wordValue Format category
channels = wordValue Number of channels
samplesPerSec = dWordValue Sampling rate
avgBytesPerSec = dWordValue For buffer estimation
blockAlign = wordValue Data block size
pcmSpecificField
bitsPerSample = wordValue Sample size
dataSubChunk
subChunkId = fourCC $subChunkIdvalueequals(data)
ckSize = dWordValue
waveData[$ckSizevalue]
waveData [long size]
(size gt= 0=gt BYTE size--)+
defines the long data type in terms of BYTE terminals
dWordValue returns [long value] throws UnsupportedEncodingException
b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE
if (byteOrderequals(II))
$value = (long)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 |
getUByteValue($b3) ltlt 16 |
getUByteValue($b4) ltlt 24 )
else
$value = (long)
( getUByteValue($b1) ltlt 24 |
getUByteValue($b2) ltlt 16 |
20
getUByteValue($b3) ltlt 8 |
getUByteValue($b4) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
if (byteOrderequals(II))
$value = (int)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 )
else
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rule used to create the Lexer
BYTE
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format
7 Related Research
Researchers have developed a number of data description languages to support validation of file
formats EAST (Enhanced ADA Subset) is a data description language developed by the
Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary
Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic
information The EAST description is used to interpret and provide access to information in
binary and text files
DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to
manipulate Java jar files and ELF object files The PADSC data description language can be
used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC
compiler compiles the description into tools that can be used to recognize manipulate and
transform the data into other formats
21
The Data Format Description Language (DFDL) is being developed by a Working Group of the
Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML
Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-
XML physical representations Version 1 of the language specification was published in 2011
and a parser is being implemented
Each of the data description languages described above can be used to define data types and file
structures of binary files However they are based on file structure declarations such as those of
C or Java The binary file format grammar described in this paper most closely resembles DFDL
However the binary file grammar described in this paper is the only data description language
based on formal grammars that is used for creating recognizers for file formats
JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to
provide automated and efficient identification and validation of the formats of digital files
JHOVE is a format-specific digital object validation API written in Java JHOVE supports
validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF
UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California
digital Library 2011]
The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation
of binary file formats As a matter of fact binary file grammars and parsers have been
constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-
based format TIFF However the research reported herein differs from that of the JHOVE
project in that what is sought is a technology for generating validators for binary file formats
from grammars specifying the binary file format
8 Conclusion
The research question addressed by his research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files Amajor family of binary file formats chunk-based was described Then extensions
to the concepts of context-free grammars and attribute grammars were described that enable the
specification of binary file formats Examples of attribute array grammars for two chunk-based
binary file format were then presented Recursive descent parsers for this classes of grammars
was then described Experience in using ANTLR a parser generator for LL() string grammars
in generating parsers for recognizing the formats of binary files was discussed Finally related
research in data description languages for binary file formats was described and the JHOVE
project in creating Java-based tools for validating file formats was described
22
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats Furthermore these grammars can be used with recursive descent
parsers for validating the file formats of chunk-based binary files It remains to be determined
whether these attribute grammars based on binary file (array) grammars are adequate to define
other families of binary file formats
To be a practical technology for generating recognizers (validators) for binary file formats a
parser generator is needed that creates from a binary file grammar a recursive descent parser
(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and
generates a lexer as well as a parser is too heavyweight a parser generator for this task and does
not have the requisite data types built in
A capability to validate binary file formats assumes that the file format of a file is already
known In this research we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009] We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendix a We also have file
signature tests (magic tests) for all these file types
We would like to have a parser generator (also called a compilerndashcompiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents The
closest parser generator that we have found to having these capabilities is ANTLR
(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar However it works with strings and generates a separate scanner
Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based
file formats We have specifications for each of file formats and samples of files with these
formats
These file formats have a similar structure We have developed binary file grammars for two of
these binary file formats We should be able to do so for all of them We used these grammars
with the extensions to ANTLR to generate top-down recursive descent parsers for the file
formats defined with these grammars
The significance of success in this research task is that if binary file formats can be specified
with binary file grammars then only one parsing generator is needed to generate the parsers for
verifying that a file conforms to a particular format Similarly the same parsing generator can be
used for conversion of legacy file formats to current or standard formats Finally the same parser
generator can be used for generating viewerplayers for most file formats This would increase
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
13
bitMapHeader
bitMapHeader
width = uWordValue
height = uWordValue
xPosition = wordValue
yPosition = wordValue
nPanes = uByteValue
masking = uByteValue
compression = uByteValue
reserved1 = uByteValue
transparentColor = uWordValue
xAspect = uByteValue
yAspect = uByteValue
pageWidth = wordValue
pageHeight = wordValue
Print out values in the bitmap structure
Systemoutprintln( Width + $widthvalue)
Systemoutprintln( Height + $heightvalue)
Systemoutprintln( X Position + $xPositionvalue)
Systemoutprintln( Y Position + $yPositionvalue)
Systemoutprintln( Number of Panes + $nPanesvalue)
Systemoutprintln( Masking + $maskingvalue)
Systemoutprintln( Compression + $compressionvalue)
Systemoutprintln( Reserved1 + $reserved1value)
Systemoutprintln( TransParent Color +
$transparentColorvalue)
Systemoutprintln( X Aspest + $xAspectvalue)
Systemoutprintln( Y Aspect + $yAspectvalue)
Systemoutprintln( Page Width + $pageWidthvalue)
Systemoutprintln( Page Height + $pageHeightvalue)
cmap returns [long value]
Initialize the index value i in the cmap function
init int i = 0
CMAP ckSize
Print out chunk size
$value = $ckSizevalue
Systemoutprintln(ID CMAP Size + $ckSizevalue)
(colorRegister
Print out color values
Systemoutprintln( Color[ + i++ + ] +
$colorRegisterredValue +
+ $colorRegistergreenValue +
+ $colorRegisterblueValue +
)
)+
colorRegister returns [int redValue int greenValue int blueValue]
14
red = uByteValue green = uByteValue blue = uByteValue
$redValue = $redvalue
$greenValue = $greenvalue
$blueValue = $bluevalue
camg
CAMG ckSize longValue
Systemoutprintln(ID CAMG Size + $ckSizevalue)
Systemoutprintln( View Modes 0x +
LongtoHexString($longValuevalue))
The crng and ccrt data chunks
dataChunks
crng
| ccrt
crng
CRNG ckSize
Systemoutprintln(ID CRNG Size + $ckSizevalue)
cRange
cRange
pad1 = wordValue $pad1value == 0
rate = wordValue
active = wordValue
low = uByteValue
high = uByteValue
ccrt
CCRT ckSize
Systemoutprintln(ID CCRT Size + $ckSizevalue)
cycleInfo
cycleInfo
direction = wordValue
start = uByteValue
end = uByteValue
seconds = longValue
microseconds = longValue
pad = wordValue $padvalue == 0
body
BODY ckSize
Systemoutprintln(ID BODY Size + $ckSizevalue)
$ckSizevalue gt 0 data
Systemoutprintln( Number of bytes + $datavalue)
$datavalue == $ckSizevalue
defines the data chunk of the body chunk
data returns [long value]
(b+=BYTE)+ $bsize() gt 0
$value = $bsize()
15
The cksize is the value of a long integer
ckSize returns [long value]
longValue
$value = $longValuevalue
defines the longValue data typein terms of BYTE terminals
longValue returns [long value] throws UnsupportedEncodingException
h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE
$value = (long)
( getUByteValue($h1) ltlt 24 |
getUByteValue($h2) ltlt 16 |
getUByteValue($l1) ltlt 8 |
getUByteValue($l2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uWordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (short)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uByteValue returns [int value]
BYTE
calls the getUByteValue from a BYTE token
16
$value = (int) getUByteValue($BYTE)
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rules used to create the Lexer
FORM FORM
ILBM ILBM
BMHD BMHD
CMAP CMAP
CAMG CAMG
CRNG CRNG
CCRT CCRT
BODY BODY
BYTE
Figure 6 An ANTLR Grammar for the ILBM File Format
622 The Lexer and Parser
When provided the file described in the previous section ANTLR generates the ilbmlexer and
ilbmparser Java programs The Java application shown in Figure 6 uses these programs to
validate the ILBM file BLKIFF It produces the output shown in Figure 7
package orggtrigrammarschunk
import javaioIOException
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRStringStream
import organtlrruntimeCharStream
import organtlrruntimeCommonTokenStream
import organtlrruntimeRecognitionException
17
import organtlrruntimeTokenStream
import gtriorggrammarschunkiblmLexer
import gtriorggrammarschunkiblmParser
public class Test
static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF
public static void main(String[] args) throws RecognitionException IOException
ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)
iblmLexer lexer = new iblmLexer(stream)
TokenStream tokenStream = new CommonTokenStream(lexer)
iblmParser parser = new iblmParser(tokenStream)
Systemoutprintln(FILE + filePath)
Systemoutprintln(Successfully parsed = + parseriblm())
Figure 7 A Java Application for Parsing an ILBM File
FILE CUserssl170PicturesIFF-ilbmBLKIFF
ID ILBM Size 2556
ID BMHD Size 20
Width 200
Height 144
X Position 0
Y Position 0
Number of Panes 6
Masking0
Compression1
Reserved10
TransParent Color0
X Aspest3
Y Aspect3
Page Width200
Page Height144
ID CMAP Size 48
Color[0] 0 0 0
Color[1] 0 128 128
Color[2] 0 255 255
Color[3] 255 0 255
Color[4] 0 255 0
Color[5] 255 0 0
Color[6] 0 0 255
Color[7] 255 255 0
Color[8] 128 128 128
Color[9] 192 192 192
Color[10] 128 0 0
Color[11] 0 128 0
Color[12] 128 128 0
Color[13] 0 0 128
Color[14] 128 0 128
18
Color[15] 255 255 255
ID CAMG Size 4
View Modes 0x800
ID BODY Size 2448
Number of bytes 2448
Successfully parsed = true
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF
63 An ANTLR Grammar for the RIFF Waveform PCM File Format
Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR
The output of ANTLR is a lexer and parser for WAVEFOR PCM files
grammar waveform_pcm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text contained in lexerheader is placed at the top of the lexer
file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Default ByteOrder to Little Endian
String byteOrder = II
Rules used to create the Parser
start symbol waveform_pcm
waveform_pcm returns [boolean result]
19
ckId = fourCC $ckIdvalueequals(RIFF)
ckSize = dWordValue
formType = fourCC $formTypevalueequals(WAVE)
formatSubChunk
dataSubChunk
$result = true
fourCC returns [String value]
BYTE BYTE BYTE BYTE $value = $text
formatSubChunk
subChunkId = fourCC $subChunkIdvalueequals(fmt )
ckSize = dWordValue
commonFields
pcmSpecificField
commonFields
formatTag = wordValue Format category
channels = wordValue Number of channels
samplesPerSec = dWordValue Sampling rate
avgBytesPerSec = dWordValue For buffer estimation
blockAlign = wordValue Data block size
pcmSpecificField
bitsPerSample = wordValue Sample size
dataSubChunk
subChunkId = fourCC $subChunkIdvalueequals(data)
ckSize = dWordValue
waveData[$ckSizevalue]
waveData [long size]
(size gt= 0=gt BYTE size--)+
defines the long data type in terms of BYTE terminals
dWordValue returns [long value] throws UnsupportedEncodingException
b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE
if (byteOrderequals(II))
$value = (long)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 |
getUByteValue($b3) ltlt 16 |
getUByteValue($b4) ltlt 24 )
else
$value = (long)
( getUByteValue($b1) ltlt 24 |
getUByteValue($b2) ltlt 16 |
20
getUByteValue($b3) ltlt 8 |
getUByteValue($b4) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
if (byteOrderequals(II))
$value = (int)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 )
else
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rule used to create the Lexer
BYTE
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format
7 Related Research
Researchers have developed a number of data description languages to support validation of file
formats EAST (Enhanced ADA Subset) is a data description language developed by the
Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary
Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic
information The EAST description is used to interpret and provide access to information in
binary and text files
DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to
manipulate Java jar files and ELF object files The PADSC data description language can be
used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC
compiler compiles the description into tools that can be used to recognize manipulate and
transform the data into other formats
21
The Data Format Description Language (DFDL) is being developed by a Working Group of the
Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML
Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-
XML physical representations Version 1 of the language specification was published in 2011
and a parser is being implemented
Each of the data description languages described above can be used to define data types and file
structures of binary files However they are based on file structure declarations such as those of
C or Java The binary file format grammar described in this paper most closely resembles DFDL
However the binary file grammar described in this paper is the only data description language
based on formal grammars that is used for creating recognizers for file formats
JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to
provide automated and efficient identification and validation of the formats of digital files
JHOVE is a format-specific digital object validation API written in Java JHOVE supports
validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF
UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California
digital Library 2011]
The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation
of binary file formats As a matter of fact binary file grammars and parsers have been
constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-
based format TIFF However the research reported herein differs from that of the JHOVE
project in that what is sought is a technology for generating validators for binary file formats
from grammars specifying the binary file format
8 Conclusion
The research question addressed by his research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files Amajor family of binary file formats chunk-based was described Then extensions
to the concepts of context-free grammars and attribute grammars were described that enable the
specification of binary file formats Examples of attribute array grammars for two chunk-based
binary file format were then presented Recursive descent parsers for this classes of grammars
was then described Experience in using ANTLR a parser generator for LL() string grammars
in generating parsers for recognizing the formats of binary files was discussed Finally related
research in data description languages for binary file formats was described and the JHOVE
project in creating Java-based tools for validating file formats was described
22
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats Furthermore these grammars can be used with recursive descent
parsers for validating the file formats of chunk-based binary files It remains to be determined
whether these attribute grammars based on binary file (array) grammars are adequate to define
other families of binary file formats
To be a practical technology for generating recognizers (validators) for binary file formats a
parser generator is needed that creates from a binary file grammar a recursive descent parser
(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and
generates a lexer as well as a parser is too heavyweight a parser generator for this task and does
not have the requisite data types built in
A capability to validate binary file formats assumes that the file format of a file is already
known In this research we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009] We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendix a We also have file
signature tests (magic tests) for all these file types
We would like to have a parser generator (also called a compilerndashcompiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents The
closest parser generator that we have found to having these capabilities is ANTLR
(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar However it works with strings and generates a separate scanner
Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based
file formats We have specifications for each of file formats and samples of files with these
formats
These file formats have a similar structure We have developed binary file grammars for two of
these binary file formats We should be able to do so for all of them We used these grammars
with the extensions to ANTLR to generate top-down recursive descent parsers for the file
formats defined with these grammars
The significance of success in this research task is that if binary file formats can be specified
with binary file grammars then only one parsing generator is needed to generate the parsers for
verifying that a file conforms to a particular format Similarly the same parsing generator can be
used for conversion of legacy file formats to current or standard formats Finally the same parser
generator can be used for generating viewerplayers for most file formats This would increase
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
14
red = uByteValue green = uByteValue blue = uByteValue
$redValue = $redvalue
$greenValue = $greenvalue
$blueValue = $bluevalue
camg
CAMG ckSize longValue
Systemoutprintln(ID CAMG Size + $ckSizevalue)
Systemoutprintln( View Modes 0x +
LongtoHexString($longValuevalue))
The crng and ccrt data chunks
dataChunks
crng
| ccrt
crng
CRNG ckSize
Systemoutprintln(ID CRNG Size + $ckSizevalue)
cRange
cRange
pad1 = wordValue $pad1value == 0
rate = wordValue
active = wordValue
low = uByteValue
high = uByteValue
ccrt
CCRT ckSize
Systemoutprintln(ID CCRT Size + $ckSizevalue)
cycleInfo
cycleInfo
direction = wordValue
start = uByteValue
end = uByteValue
seconds = longValue
microseconds = longValue
pad = wordValue $padvalue == 0
body
BODY ckSize
Systemoutprintln(ID BODY Size + $ckSizevalue)
$ckSizevalue gt 0 data
Systemoutprintln( Number of bytes + $datavalue)
$datavalue == $ckSizevalue
defines the data chunk of the body chunk
data returns [long value]
(b+=BYTE)+ $bsize() gt 0
$value = $bsize()
15
The cksize is the value of a long integer
ckSize returns [long value]
longValue
$value = $longValuevalue
defines the longValue data typein terms of BYTE terminals
longValue returns [long value] throws UnsupportedEncodingException
h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE
$value = (long)
( getUByteValue($h1) ltlt 24 |
getUByteValue($h2) ltlt 16 |
getUByteValue($l1) ltlt 8 |
getUByteValue($l2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uWordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (short)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uByteValue returns [int value]
BYTE
calls the getUByteValue from a BYTE token
16
$value = (int) getUByteValue($BYTE)
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rules used to create the Lexer
FORM FORM
ILBM ILBM
BMHD BMHD
CMAP CMAP
CAMG CAMG
CRNG CRNG
CCRT CCRT
BODY BODY
BYTE
Figure 6 An ANTLR Grammar for the ILBM File Format
622 The Lexer and Parser
When provided the file described in the previous section ANTLR generates the ilbmlexer and
ilbmparser Java programs The Java application shown in Figure 6 uses these programs to
validate the ILBM file BLKIFF It produces the output shown in Figure 7
package orggtrigrammarschunk
import javaioIOException
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRStringStream
import organtlrruntimeCharStream
import organtlrruntimeCommonTokenStream
import organtlrruntimeRecognitionException
17
import organtlrruntimeTokenStream
import gtriorggrammarschunkiblmLexer
import gtriorggrammarschunkiblmParser
public class Test
static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF
public static void main(String[] args) throws RecognitionException IOException
ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)
iblmLexer lexer = new iblmLexer(stream)
TokenStream tokenStream = new CommonTokenStream(lexer)
iblmParser parser = new iblmParser(tokenStream)
Systemoutprintln(FILE + filePath)
Systemoutprintln(Successfully parsed = + parseriblm())
Figure 7 A Java Application for Parsing an ILBM File
FILE CUserssl170PicturesIFF-ilbmBLKIFF
ID ILBM Size 2556
ID BMHD Size 20
Width 200
Height 144
X Position 0
Y Position 0
Number of Panes 6
Masking0
Compression1
Reserved10
TransParent Color0
X Aspest3
Y Aspect3
Page Width200
Page Height144
ID CMAP Size 48
Color[0] 0 0 0
Color[1] 0 128 128
Color[2] 0 255 255
Color[3] 255 0 255
Color[4] 0 255 0
Color[5] 255 0 0
Color[6] 0 0 255
Color[7] 255 255 0
Color[8] 128 128 128
Color[9] 192 192 192
Color[10] 128 0 0
Color[11] 0 128 0
Color[12] 128 128 0
Color[13] 0 0 128
Color[14] 128 0 128
18
Color[15] 255 255 255
ID CAMG Size 4
View Modes 0x800
ID BODY Size 2448
Number of bytes 2448
Successfully parsed = true
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF
63 An ANTLR Grammar for the RIFF Waveform PCM File Format
Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR
The output of ANTLR is a lexer and parser for WAVEFOR PCM files
grammar waveform_pcm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text contained in lexerheader is placed at the top of the lexer
file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Default ByteOrder to Little Endian
String byteOrder = II
Rules used to create the Parser
start symbol waveform_pcm
waveform_pcm returns [boolean result]
19
ckId = fourCC $ckIdvalueequals(RIFF)
ckSize = dWordValue
formType = fourCC $formTypevalueequals(WAVE)
formatSubChunk
dataSubChunk
$result = true
fourCC returns [String value]
BYTE BYTE BYTE BYTE $value = $text
formatSubChunk
subChunkId = fourCC $subChunkIdvalueequals(fmt )
ckSize = dWordValue
commonFields
pcmSpecificField
commonFields
formatTag = wordValue Format category
channels = wordValue Number of channels
samplesPerSec = dWordValue Sampling rate
avgBytesPerSec = dWordValue For buffer estimation
blockAlign = wordValue Data block size
pcmSpecificField
bitsPerSample = wordValue Sample size
dataSubChunk
subChunkId = fourCC $subChunkIdvalueequals(data)
ckSize = dWordValue
waveData[$ckSizevalue]
waveData [long size]
(size gt= 0=gt BYTE size--)+
defines the long data type in terms of BYTE terminals
dWordValue returns [long value] throws UnsupportedEncodingException
b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE
if (byteOrderequals(II))
$value = (long)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 |
getUByteValue($b3) ltlt 16 |
getUByteValue($b4) ltlt 24 )
else
$value = (long)
( getUByteValue($b1) ltlt 24 |
getUByteValue($b2) ltlt 16 |
20
getUByteValue($b3) ltlt 8 |
getUByteValue($b4) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
if (byteOrderequals(II))
$value = (int)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 )
else
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rule used to create the Lexer
BYTE
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format
7 Related Research
Researchers have developed a number of data description languages to support validation of file
formats EAST (Enhanced ADA Subset) is a data description language developed by the
Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary
Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic
information The EAST description is used to interpret and provide access to information in
binary and text files
DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to
manipulate Java jar files and ELF object files The PADSC data description language can be
used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC
compiler compiles the description into tools that can be used to recognize manipulate and
transform the data into other formats
21
The Data Format Description Language (DFDL) is being developed by a Working Group of the
Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML
Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-
XML physical representations Version 1 of the language specification was published in 2011
and a parser is being implemented
Each of the data description languages described above can be used to define data types and file
structures of binary files However they are based on file structure declarations such as those of
C or Java The binary file format grammar described in this paper most closely resembles DFDL
However the binary file grammar described in this paper is the only data description language
based on formal grammars that is used for creating recognizers for file formats
JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to
provide automated and efficient identification and validation of the formats of digital files
JHOVE is a format-specific digital object validation API written in Java JHOVE supports
validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF
UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California
digital Library 2011]
The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation
of binary file formats As a matter of fact binary file grammars and parsers have been
constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-
based format TIFF However the research reported herein differs from that of the JHOVE
project in that what is sought is a technology for generating validators for binary file formats
from grammars specifying the binary file format
8 Conclusion
The research question addressed by his research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files Amajor family of binary file formats chunk-based was described Then extensions
to the concepts of context-free grammars and attribute grammars were described that enable the
specification of binary file formats Examples of attribute array grammars for two chunk-based
binary file format were then presented Recursive descent parsers for this classes of grammars
was then described Experience in using ANTLR a parser generator for LL() string grammars
in generating parsers for recognizing the formats of binary files was discussed Finally related
research in data description languages for binary file formats was described and the JHOVE
project in creating Java-based tools for validating file formats was described
22
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats Furthermore these grammars can be used with recursive descent
parsers for validating the file formats of chunk-based binary files It remains to be determined
whether these attribute grammars based on binary file (array) grammars are adequate to define
other families of binary file formats
To be a practical technology for generating recognizers (validators) for binary file formats a
parser generator is needed that creates from a binary file grammar a recursive descent parser
(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and
generates a lexer as well as a parser is too heavyweight a parser generator for this task and does
not have the requisite data types built in
A capability to validate binary file formats assumes that the file format of a file is already
known In this research we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009] We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendix a We also have file
signature tests (magic tests) for all these file types
We would like to have a parser generator (also called a compilerndashcompiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents The
closest parser generator that we have found to having these capabilities is ANTLR
(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar However it works with strings and generates a separate scanner
Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based
file formats We have specifications for each of file formats and samples of files with these
formats
These file formats have a similar structure We have developed binary file grammars for two of
these binary file formats We should be able to do so for all of them We used these grammars
with the extensions to ANTLR to generate top-down recursive descent parsers for the file
formats defined with these grammars
The significance of success in this research task is that if binary file formats can be specified
with binary file grammars then only one parsing generator is needed to generate the parsers for
verifying that a file conforms to a particular format Similarly the same parsing generator can be
used for conversion of legacy file formats to current or standard formats Finally the same parser
generator can be used for generating viewerplayers for most file formats This would increase
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
15
The cksize is the value of a long integer
ckSize returns [long value]
longValue
$value = $longValuevalue
defines the longValue data typein terms of BYTE terminals
longValue returns [long value] throws UnsupportedEncodingException
h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE
$value = (long)
( getUByteValue($h1) ltlt 24 |
getUByteValue($h2) ltlt 16 |
getUByteValue($l1) ltlt 8 |
getUByteValue($l2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uWordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
$value = (short)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2))
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the unsigned word data type
uByteValue returns [int value]
BYTE
calls the getUByteValue from a BYTE token
16
$value = (int) getUByteValue($BYTE)
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rules used to create the Lexer
FORM FORM
ILBM ILBM
BMHD BMHD
CMAP CMAP
CAMG CAMG
CRNG CRNG
CCRT CCRT
BODY BODY
BYTE
Figure 6 An ANTLR Grammar for the ILBM File Format
622 The Lexer and Parser
When provided the file described in the previous section ANTLR generates the ilbmlexer and
ilbmparser Java programs The Java application shown in Figure 6 uses these programs to
validate the ILBM file BLKIFF It produces the output shown in Figure 7
package orggtrigrammarschunk
import javaioIOException
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRStringStream
import organtlrruntimeCharStream
import organtlrruntimeCommonTokenStream
import organtlrruntimeRecognitionException
17
import organtlrruntimeTokenStream
import gtriorggrammarschunkiblmLexer
import gtriorggrammarschunkiblmParser
public class Test
static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF
public static void main(String[] args) throws RecognitionException IOException
ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)
iblmLexer lexer = new iblmLexer(stream)
TokenStream tokenStream = new CommonTokenStream(lexer)
iblmParser parser = new iblmParser(tokenStream)
Systemoutprintln(FILE + filePath)
Systemoutprintln(Successfully parsed = + parseriblm())
Figure 7 A Java Application for Parsing an ILBM File
FILE CUserssl170PicturesIFF-ilbmBLKIFF
ID ILBM Size 2556
ID BMHD Size 20
Width 200
Height 144
X Position 0
Y Position 0
Number of Panes 6
Masking0
Compression1
Reserved10
TransParent Color0
X Aspest3
Y Aspect3
Page Width200
Page Height144
ID CMAP Size 48
Color[0] 0 0 0
Color[1] 0 128 128
Color[2] 0 255 255
Color[3] 255 0 255
Color[4] 0 255 0
Color[5] 255 0 0
Color[6] 0 0 255
Color[7] 255 255 0
Color[8] 128 128 128
Color[9] 192 192 192
Color[10] 128 0 0
Color[11] 0 128 0
Color[12] 128 128 0
Color[13] 0 0 128
Color[14] 128 0 128
18
Color[15] 255 255 255
ID CAMG Size 4
View Modes 0x800
ID BODY Size 2448
Number of bytes 2448
Successfully parsed = true
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF
63 An ANTLR Grammar for the RIFF Waveform PCM File Format
Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR
The output of ANTLR is a lexer and parser for WAVEFOR PCM files
grammar waveform_pcm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text contained in lexerheader is placed at the top of the lexer
file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Default ByteOrder to Little Endian
String byteOrder = II
Rules used to create the Parser
start symbol waveform_pcm
waveform_pcm returns [boolean result]
19
ckId = fourCC $ckIdvalueequals(RIFF)
ckSize = dWordValue
formType = fourCC $formTypevalueequals(WAVE)
formatSubChunk
dataSubChunk
$result = true
fourCC returns [String value]
BYTE BYTE BYTE BYTE $value = $text
formatSubChunk
subChunkId = fourCC $subChunkIdvalueequals(fmt )
ckSize = dWordValue
commonFields
pcmSpecificField
commonFields
formatTag = wordValue Format category
channels = wordValue Number of channels
samplesPerSec = dWordValue Sampling rate
avgBytesPerSec = dWordValue For buffer estimation
blockAlign = wordValue Data block size
pcmSpecificField
bitsPerSample = wordValue Sample size
dataSubChunk
subChunkId = fourCC $subChunkIdvalueequals(data)
ckSize = dWordValue
waveData[$ckSizevalue]
waveData [long size]
(size gt= 0=gt BYTE size--)+
defines the long data type in terms of BYTE terminals
dWordValue returns [long value] throws UnsupportedEncodingException
b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE
if (byteOrderequals(II))
$value = (long)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 |
getUByteValue($b3) ltlt 16 |
getUByteValue($b4) ltlt 24 )
else
$value = (long)
( getUByteValue($b1) ltlt 24 |
getUByteValue($b2) ltlt 16 |
20
getUByteValue($b3) ltlt 8 |
getUByteValue($b4) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
if (byteOrderequals(II))
$value = (int)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 )
else
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rule used to create the Lexer
BYTE
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format
7 Related Research
Researchers have developed a number of data description languages to support validation of file
formats EAST (Enhanced ADA Subset) is a data description language developed by the
Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary
Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic
information The EAST description is used to interpret and provide access to information in
binary and text files
DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to
manipulate Java jar files and ELF object files The PADSC data description language can be
used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC
compiler compiles the description into tools that can be used to recognize manipulate and
transform the data into other formats
21
The Data Format Description Language (DFDL) is being developed by a Working Group of the
Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML
Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-
XML physical representations Version 1 of the language specification was published in 2011
and a parser is being implemented
Each of the data description languages described above can be used to define data types and file
structures of binary files However they are based on file structure declarations such as those of
C or Java The binary file format grammar described in this paper most closely resembles DFDL
However the binary file grammar described in this paper is the only data description language
based on formal grammars that is used for creating recognizers for file formats
JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to
provide automated and efficient identification and validation of the formats of digital files
JHOVE is a format-specific digital object validation API written in Java JHOVE supports
validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF
UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California
digital Library 2011]
The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation
of binary file formats As a matter of fact binary file grammars and parsers have been
constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-
based format TIFF However the research reported herein differs from that of the JHOVE
project in that what is sought is a technology for generating validators for binary file formats
from grammars specifying the binary file format
8 Conclusion
The research question addressed by his research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files Amajor family of binary file formats chunk-based was described Then extensions
to the concepts of context-free grammars and attribute grammars were described that enable the
specification of binary file formats Examples of attribute array grammars for two chunk-based
binary file format were then presented Recursive descent parsers for this classes of grammars
was then described Experience in using ANTLR a parser generator for LL() string grammars
in generating parsers for recognizing the formats of binary files was discussed Finally related
research in data description languages for binary file formats was described and the JHOVE
project in creating Java-based tools for validating file formats was described
22
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats Furthermore these grammars can be used with recursive descent
parsers for validating the file formats of chunk-based binary files It remains to be determined
whether these attribute grammars based on binary file (array) grammars are adequate to define
other families of binary file formats
To be a practical technology for generating recognizers (validators) for binary file formats a
parser generator is needed that creates from a binary file grammar a recursive descent parser
(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and
generates a lexer as well as a parser is too heavyweight a parser generator for this task and does
not have the requisite data types built in
A capability to validate binary file formats assumes that the file format of a file is already
known In this research we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009] We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendix a We also have file
signature tests (magic tests) for all these file types
We would like to have a parser generator (also called a compilerndashcompiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents The
closest parser generator that we have found to having these capabilities is ANTLR
(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar However it works with strings and generates a separate scanner
Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based
file formats We have specifications for each of file formats and samples of files with these
formats
These file formats have a similar structure We have developed binary file grammars for two of
these binary file formats We should be able to do so for all of them We used these grammars
with the extensions to ANTLR to generate top-down recursive descent parsers for the file
formats defined with these grammars
The significance of success in this research task is that if binary file formats can be specified
with binary file grammars then only one parsing generator is needed to generate the parsers for
verifying that a file conforms to a particular format Similarly the same parsing generator can be
used for conversion of legacy file formats to current or standard formats Finally the same parser
generator can be used for generating viewerplayers for most file formats This would increase
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
16
$value = (int) getUByteValue($BYTE)
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rules used to create the Lexer
FORM FORM
ILBM ILBM
BMHD BMHD
CMAP CMAP
CAMG CAMG
CRNG CRNG
CCRT CCRT
BODY BODY
BYTE
Figure 6 An ANTLR Grammar for the ILBM File Format
622 The Lexer and Parser
When provided the file described in the previous section ANTLR generates the ilbmlexer and
ilbmparser Java programs The Java application shown in Figure 6 uses these programs to
validate the ILBM file BLKIFF It produces the output shown in Figure 7
package orggtrigrammarschunk
import javaioIOException
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRFileStream
import organtlrruntimeANTLRStringStream
import organtlrruntimeCharStream
import organtlrruntimeCommonTokenStream
import organtlrruntimeRecognitionException
17
import organtlrruntimeTokenStream
import gtriorggrammarschunkiblmLexer
import gtriorggrammarschunkiblmParser
public class Test
static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF
public static void main(String[] args) throws RecognitionException IOException
ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)
iblmLexer lexer = new iblmLexer(stream)
TokenStream tokenStream = new CommonTokenStream(lexer)
iblmParser parser = new iblmParser(tokenStream)
Systemoutprintln(FILE + filePath)
Systemoutprintln(Successfully parsed = + parseriblm())
Figure 7 A Java Application for Parsing an ILBM File
FILE CUserssl170PicturesIFF-ilbmBLKIFF
ID ILBM Size 2556
ID BMHD Size 20
Width 200
Height 144
X Position 0
Y Position 0
Number of Panes 6
Masking0
Compression1
Reserved10
TransParent Color0
X Aspest3
Y Aspect3
Page Width200
Page Height144
ID CMAP Size 48
Color[0] 0 0 0
Color[1] 0 128 128
Color[2] 0 255 255
Color[3] 255 0 255
Color[4] 0 255 0
Color[5] 255 0 0
Color[6] 0 0 255
Color[7] 255 255 0
Color[8] 128 128 128
Color[9] 192 192 192
Color[10] 128 0 0
Color[11] 0 128 0
Color[12] 128 128 0
Color[13] 0 0 128
Color[14] 128 0 128
18
Color[15] 255 255 255
ID CAMG Size 4
View Modes 0x800
ID BODY Size 2448
Number of bytes 2448
Successfully parsed = true
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF
63 An ANTLR Grammar for the RIFF Waveform PCM File Format
Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR
The output of ANTLR is a lexer and parser for WAVEFOR PCM files
grammar waveform_pcm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text contained in lexerheader is placed at the top of the lexer
file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Default ByteOrder to Little Endian
String byteOrder = II
Rules used to create the Parser
start symbol waveform_pcm
waveform_pcm returns [boolean result]
19
ckId = fourCC $ckIdvalueequals(RIFF)
ckSize = dWordValue
formType = fourCC $formTypevalueequals(WAVE)
formatSubChunk
dataSubChunk
$result = true
fourCC returns [String value]
BYTE BYTE BYTE BYTE $value = $text
formatSubChunk
subChunkId = fourCC $subChunkIdvalueequals(fmt )
ckSize = dWordValue
commonFields
pcmSpecificField
commonFields
formatTag = wordValue Format category
channels = wordValue Number of channels
samplesPerSec = dWordValue Sampling rate
avgBytesPerSec = dWordValue For buffer estimation
blockAlign = wordValue Data block size
pcmSpecificField
bitsPerSample = wordValue Sample size
dataSubChunk
subChunkId = fourCC $subChunkIdvalueequals(data)
ckSize = dWordValue
waveData[$ckSizevalue]
waveData [long size]
(size gt= 0=gt BYTE size--)+
defines the long data type in terms of BYTE terminals
dWordValue returns [long value] throws UnsupportedEncodingException
b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE
if (byteOrderequals(II))
$value = (long)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 |
getUByteValue($b3) ltlt 16 |
getUByteValue($b4) ltlt 24 )
else
$value = (long)
( getUByteValue($b1) ltlt 24 |
getUByteValue($b2) ltlt 16 |
20
getUByteValue($b3) ltlt 8 |
getUByteValue($b4) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
if (byteOrderequals(II))
$value = (int)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 )
else
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rule used to create the Lexer
BYTE
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format
7 Related Research
Researchers have developed a number of data description languages to support validation of file
formats EAST (Enhanced ADA Subset) is a data description language developed by the
Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary
Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic
information The EAST description is used to interpret and provide access to information in
binary and text files
DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to
manipulate Java jar files and ELF object files The PADSC data description language can be
used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC
compiler compiles the description into tools that can be used to recognize manipulate and
transform the data into other formats
21
The Data Format Description Language (DFDL) is being developed by a Working Group of the
Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML
Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-
XML physical representations Version 1 of the language specification was published in 2011
and a parser is being implemented
Each of the data description languages described above can be used to define data types and file
structures of binary files However they are based on file structure declarations such as those of
C or Java The binary file format grammar described in this paper most closely resembles DFDL
However the binary file grammar described in this paper is the only data description language
based on formal grammars that is used for creating recognizers for file formats
JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to
provide automated and efficient identification and validation of the formats of digital files
JHOVE is a format-specific digital object validation API written in Java JHOVE supports
validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF
UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California
digital Library 2011]
The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation
of binary file formats As a matter of fact binary file grammars and parsers have been
constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-
based format TIFF However the research reported herein differs from that of the JHOVE
project in that what is sought is a technology for generating validators for binary file formats
from grammars specifying the binary file format
8 Conclusion
The research question addressed by his research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files Amajor family of binary file formats chunk-based was described Then extensions
to the concepts of context-free grammars and attribute grammars were described that enable the
specification of binary file formats Examples of attribute array grammars for two chunk-based
binary file format were then presented Recursive descent parsers for this classes of grammars
was then described Experience in using ANTLR a parser generator for LL() string grammars
in generating parsers for recognizing the formats of binary files was discussed Finally related
research in data description languages for binary file formats was described and the JHOVE
project in creating Java-based tools for validating file formats was described
22
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats Furthermore these grammars can be used with recursive descent
parsers for validating the file formats of chunk-based binary files It remains to be determined
whether these attribute grammars based on binary file (array) grammars are adequate to define
other families of binary file formats
To be a practical technology for generating recognizers (validators) for binary file formats a
parser generator is needed that creates from a binary file grammar a recursive descent parser
(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and
generates a lexer as well as a parser is too heavyweight a parser generator for this task and does
not have the requisite data types built in
A capability to validate binary file formats assumes that the file format of a file is already
known In this research we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009] We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendix a We also have file
signature tests (magic tests) for all these file types
We would like to have a parser generator (also called a compilerndashcompiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents The
closest parser generator that we have found to having these capabilities is ANTLR
(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar However it works with strings and generates a separate scanner
Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based
file formats We have specifications for each of file formats and samples of files with these
formats
These file formats have a similar structure We have developed binary file grammars for two of
these binary file formats We should be able to do so for all of them We used these grammars
with the extensions to ANTLR to generate top-down recursive descent parsers for the file
formats defined with these grammars
The significance of success in this research task is that if binary file formats can be specified
with binary file grammars then only one parsing generator is needed to generate the parsers for
verifying that a file conforms to a particular format Similarly the same parsing generator can be
used for conversion of legacy file formats to current or standard formats Finally the same parser
generator can be used for generating viewerplayers for most file formats This would increase
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
17
import organtlrruntimeTokenStream
import gtriorggrammarschunkiblmLexer
import gtriorggrammarschunkiblmParser
public class Test
static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF
public static void main(String[] args) throws RecognitionException IOException
ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)
iblmLexer lexer = new iblmLexer(stream)
TokenStream tokenStream = new CommonTokenStream(lexer)
iblmParser parser = new iblmParser(tokenStream)
Systemoutprintln(FILE + filePath)
Systemoutprintln(Successfully parsed = + parseriblm())
Figure 7 A Java Application for Parsing an ILBM File
FILE CUserssl170PicturesIFF-ilbmBLKIFF
ID ILBM Size 2556
ID BMHD Size 20
Width 200
Height 144
X Position 0
Y Position 0
Number of Panes 6
Masking0
Compression1
Reserved10
TransParent Color0
X Aspest3
Y Aspect3
Page Width200
Page Height144
ID CMAP Size 48
Color[0] 0 0 0
Color[1] 0 128 128
Color[2] 0 255 255
Color[3] 255 0 255
Color[4] 0 255 0
Color[5] 255 0 0
Color[6] 0 0 255
Color[7] 255 255 0
Color[8] 128 128 128
Color[9] 192 192 192
Color[10] 128 0 0
Color[11] 0 128 0
Color[12] 128 128 0
Color[13] 0 0 128
Color[14] 128 0 128
18
Color[15] 255 255 255
ID CAMG Size 4
View Modes 0x800
ID BODY Size 2448
Number of bytes 2448
Successfully parsed = true
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF
63 An ANTLR Grammar for the RIFF Waveform PCM File Format
Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR
The output of ANTLR is a lexer and parser for WAVEFOR PCM files
grammar waveform_pcm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text contained in lexerheader is placed at the top of the lexer
file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Default ByteOrder to Little Endian
String byteOrder = II
Rules used to create the Parser
start symbol waveform_pcm
waveform_pcm returns [boolean result]
19
ckId = fourCC $ckIdvalueequals(RIFF)
ckSize = dWordValue
formType = fourCC $formTypevalueequals(WAVE)
formatSubChunk
dataSubChunk
$result = true
fourCC returns [String value]
BYTE BYTE BYTE BYTE $value = $text
formatSubChunk
subChunkId = fourCC $subChunkIdvalueequals(fmt )
ckSize = dWordValue
commonFields
pcmSpecificField
commonFields
formatTag = wordValue Format category
channels = wordValue Number of channels
samplesPerSec = dWordValue Sampling rate
avgBytesPerSec = dWordValue For buffer estimation
blockAlign = wordValue Data block size
pcmSpecificField
bitsPerSample = wordValue Sample size
dataSubChunk
subChunkId = fourCC $subChunkIdvalueequals(data)
ckSize = dWordValue
waveData[$ckSizevalue]
waveData [long size]
(size gt= 0=gt BYTE size--)+
defines the long data type in terms of BYTE terminals
dWordValue returns [long value] throws UnsupportedEncodingException
b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE
if (byteOrderequals(II))
$value = (long)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 |
getUByteValue($b3) ltlt 16 |
getUByteValue($b4) ltlt 24 )
else
$value = (long)
( getUByteValue($b1) ltlt 24 |
getUByteValue($b2) ltlt 16 |
20
getUByteValue($b3) ltlt 8 |
getUByteValue($b4) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
if (byteOrderequals(II))
$value = (int)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 )
else
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rule used to create the Lexer
BYTE
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format
7 Related Research
Researchers have developed a number of data description languages to support validation of file
formats EAST (Enhanced ADA Subset) is a data description language developed by the
Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary
Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic
information The EAST description is used to interpret and provide access to information in
binary and text files
DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to
manipulate Java jar files and ELF object files The PADSC data description language can be
used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC
compiler compiles the description into tools that can be used to recognize manipulate and
transform the data into other formats
21
The Data Format Description Language (DFDL) is being developed by a Working Group of the
Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML
Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-
XML physical representations Version 1 of the language specification was published in 2011
and a parser is being implemented
Each of the data description languages described above can be used to define data types and file
structures of binary files However they are based on file structure declarations such as those of
C or Java The binary file format grammar described in this paper most closely resembles DFDL
However the binary file grammar described in this paper is the only data description language
based on formal grammars that is used for creating recognizers for file formats
JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to
provide automated and efficient identification and validation of the formats of digital files
JHOVE is a format-specific digital object validation API written in Java JHOVE supports
validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF
UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California
digital Library 2011]
The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation
of binary file formats As a matter of fact binary file grammars and parsers have been
constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-
based format TIFF However the research reported herein differs from that of the JHOVE
project in that what is sought is a technology for generating validators for binary file formats
from grammars specifying the binary file format
8 Conclusion
The research question addressed by his research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files Amajor family of binary file formats chunk-based was described Then extensions
to the concepts of context-free grammars and attribute grammars were described that enable the
specification of binary file formats Examples of attribute array grammars for two chunk-based
binary file format were then presented Recursive descent parsers for this classes of grammars
was then described Experience in using ANTLR a parser generator for LL() string grammars
in generating parsers for recognizing the formats of binary files was discussed Finally related
research in data description languages for binary file formats was described and the JHOVE
project in creating Java-based tools for validating file formats was described
22
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats Furthermore these grammars can be used with recursive descent
parsers for validating the file formats of chunk-based binary files It remains to be determined
whether these attribute grammars based on binary file (array) grammars are adequate to define
other families of binary file formats
To be a practical technology for generating recognizers (validators) for binary file formats a
parser generator is needed that creates from a binary file grammar a recursive descent parser
(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and
generates a lexer as well as a parser is too heavyweight a parser generator for this task and does
not have the requisite data types built in
A capability to validate binary file formats assumes that the file format of a file is already
known In this research we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009] We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendix a We also have file
signature tests (magic tests) for all these file types
We would like to have a parser generator (also called a compilerndashcompiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents The
closest parser generator that we have found to having these capabilities is ANTLR
(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar However it works with strings and generates a separate scanner
Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based
file formats We have specifications for each of file formats and samples of files with these
formats
These file formats have a similar structure We have developed binary file grammars for two of
these binary file formats We should be able to do so for all of them We used these grammars
with the extensions to ANTLR to generate top-down recursive descent parsers for the file
formats defined with these grammars
The significance of success in this research task is that if binary file formats can be specified
with binary file grammars then only one parsing generator is needed to generate the parsers for
verifying that a file conforms to a particular format Similarly the same parsing generator can be
used for conversion of legacy file formats to current or standard formats Finally the same parser
generator can be used for generating viewerplayers for most file formats This would increase
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
18
Color[15] 255 255 255
ID CAMG Size 4
View Modes 0x800
ID BODY Size 2448
Number of bytes 2448
Successfully parsed = true
Figure 8 Output of the ILBM Parser Applied to the file BLKIFF
63 An ANTLR Grammar for the RIFF Waveform PCM File Format
Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR
The output of ANTLR is a lexer and parser for WAVEFOR PCM files
grammar waveform_pcm
options
language = Java The programming language of the output parser
TokenLabelType = CommonToken
The text contained in header is placed at the top of the parser file
header
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text contained in lexerheader is placed at the top of the lexer
file
lexerheader
package gtriorggrammarschunk
import javaioUnsupportedEncodingException
The text in the members section is placed inside the parser class
members
getUByteValue inputs the token of a Byte terminal and returns an
integer value
public int getUByteValue(Token tok) throws UnsupportedEncodingException
int intValue = 0
char charValue = tokgetText()charAt(0)
intValue = (int) charValue
intValue = intValue amp 0x00ff
return intValue
Default ByteOrder to Little Endian
String byteOrder = II
Rules used to create the Parser
start symbol waveform_pcm
waveform_pcm returns [boolean result]
19
ckId = fourCC $ckIdvalueequals(RIFF)
ckSize = dWordValue
formType = fourCC $formTypevalueequals(WAVE)
formatSubChunk
dataSubChunk
$result = true
fourCC returns [String value]
BYTE BYTE BYTE BYTE $value = $text
formatSubChunk
subChunkId = fourCC $subChunkIdvalueequals(fmt )
ckSize = dWordValue
commonFields
pcmSpecificField
commonFields
formatTag = wordValue Format category
channels = wordValue Number of channels
samplesPerSec = dWordValue Sampling rate
avgBytesPerSec = dWordValue For buffer estimation
blockAlign = wordValue Data block size
pcmSpecificField
bitsPerSample = wordValue Sample size
dataSubChunk
subChunkId = fourCC $subChunkIdvalueequals(data)
ckSize = dWordValue
waveData[$ckSizevalue]
waveData [long size]
(size gt= 0=gt BYTE size--)+
defines the long data type in terms of BYTE terminals
dWordValue returns [long value] throws UnsupportedEncodingException
b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE
if (byteOrderequals(II))
$value = (long)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 |
getUByteValue($b3) ltlt 16 |
getUByteValue($b4) ltlt 24 )
else
$value = (long)
( getUByteValue($b1) ltlt 24 |
getUByteValue($b2) ltlt 16 |
20
getUByteValue($b3) ltlt 8 |
getUByteValue($b4) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
if (byteOrderequals(II))
$value = (int)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 )
else
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rule used to create the Lexer
BYTE
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format
7 Related Research
Researchers have developed a number of data description languages to support validation of file
formats EAST (Enhanced ADA Subset) is a data description language developed by the
Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary
Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic
information The EAST description is used to interpret and provide access to information in
binary and text files
DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to
manipulate Java jar files and ELF object files The PADSC data description language can be
used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC
compiler compiles the description into tools that can be used to recognize manipulate and
transform the data into other formats
21
The Data Format Description Language (DFDL) is being developed by a Working Group of the
Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML
Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-
XML physical representations Version 1 of the language specification was published in 2011
and a parser is being implemented
Each of the data description languages described above can be used to define data types and file
structures of binary files However they are based on file structure declarations such as those of
C or Java The binary file format grammar described in this paper most closely resembles DFDL
However the binary file grammar described in this paper is the only data description language
based on formal grammars that is used for creating recognizers for file formats
JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to
provide automated and efficient identification and validation of the formats of digital files
JHOVE is a format-specific digital object validation API written in Java JHOVE supports
validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF
UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California
digital Library 2011]
The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation
of binary file formats As a matter of fact binary file grammars and parsers have been
constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-
based format TIFF However the research reported herein differs from that of the JHOVE
project in that what is sought is a technology for generating validators for binary file formats
from grammars specifying the binary file format
8 Conclusion
The research question addressed by his research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files Amajor family of binary file formats chunk-based was described Then extensions
to the concepts of context-free grammars and attribute grammars were described that enable the
specification of binary file formats Examples of attribute array grammars for two chunk-based
binary file format were then presented Recursive descent parsers for this classes of grammars
was then described Experience in using ANTLR a parser generator for LL() string grammars
in generating parsers for recognizing the formats of binary files was discussed Finally related
research in data description languages for binary file formats was described and the JHOVE
project in creating Java-based tools for validating file formats was described
22
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats Furthermore these grammars can be used with recursive descent
parsers for validating the file formats of chunk-based binary files It remains to be determined
whether these attribute grammars based on binary file (array) grammars are adequate to define
other families of binary file formats
To be a practical technology for generating recognizers (validators) for binary file formats a
parser generator is needed that creates from a binary file grammar a recursive descent parser
(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and
generates a lexer as well as a parser is too heavyweight a parser generator for this task and does
not have the requisite data types built in
A capability to validate binary file formats assumes that the file format of a file is already
known In this research we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009] We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendix a We also have file
signature tests (magic tests) for all these file types
We would like to have a parser generator (also called a compilerndashcompiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents The
closest parser generator that we have found to having these capabilities is ANTLR
(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar However it works with strings and generates a separate scanner
Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based
file formats We have specifications for each of file formats and samples of files with these
formats
These file formats have a similar structure We have developed binary file grammars for two of
these binary file formats We should be able to do so for all of them We used these grammars
with the extensions to ANTLR to generate top-down recursive descent parsers for the file
formats defined with these grammars
The significance of success in this research task is that if binary file formats can be specified
with binary file grammars then only one parsing generator is needed to generate the parsers for
verifying that a file conforms to a particular format Similarly the same parsing generator can be
used for conversion of legacy file formats to current or standard formats Finally the same parser
generator can be used for generating viewerplayers for most file formats This would increase
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
19
ckId = fourCC $ckIdvalueequals(RIFF)
ckSize = dWordValue
formType = fourCC $formTypevalueequals(WAVE)
formatSubChunk
dataSubChunk
$result = true
fourCC returns [String value]
BYTE BYTE BYTE BYTE $value = $text
formatSubChunk
subChunkId = fourCC $subChunkIdvalueequals(fmt )
ckSize = dWordValue
commonFields
pcmSpecificField
commonFields
formatTag = wordValue Format category
channels = wordValue Number of channels
samplesPerSec = dWordValue Sampling rate
avgBytesPerSec = dWordValue For buffer estimation
blockAlign = wordValue Data block size
pcmSpecificField
bitsPerSample = wordValue Sample size
dataSubChunk
subChunkId = fourCC $subChunkIdvalueequals(data)
ckSize = dWordValue
waveData[$ckSizevalue]
waveData [long size]
(size gt= 0=gt BYTE size--)+
defines the long data type in terms of BYTE terminals
dWordValue returns [long value] throws UnsupportedEncodingException
b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE
if (byteOrderequals(II))
$value = (long)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 |
getUByteValue($b3) ltlt 16 |
getUByteValue($b4) ltlt 24 )
else
$value = (long)
( getUByteValue($b1) ltlt 24 |
getUByteValue($b2) ltlt 16 |
20
getUByteValue($b3) ltlt 8 |
getUByteValue($b4) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
if (byteOrderequals(II))
$value = (int)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 )
else
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rule used to create the Lexer
BYTE
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format
7 Related Research
Researchers have developed a number of data description languages to support validation of file
formats EAST (Enhanced ADA Subset) is a data description language developed by the
Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary
Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic
information The EAST description is used to interpret and provide access to information in
binary and text files
DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to
manipulate Java jar files and ELF object files The PADSC data description language can be
used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC
compiler compiles the description into tools that can be used to recognize manipulate and
transform the data into other formats
21
The Data Format Description Language (DFDL) is being developed by a Working Group of the
Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML
Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-
XML physical representations Version 1 of the language specification was published in 2011
and a parser is being implemented
Each of the data description languages described above can be used to define data types and file
structures of binary files However they are based on file structure declarations such as those of
C or Java The binary file format grammar described in this paper most closely resembles DFDL
However the binary file grammar described in this paper is the only data description language
based on formal grammars that is used for creating recognizers for file formats
JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to
provide automated and efficient identification and validation of the formats of digital files
JHOVE is a format-specific digital object validation API written in Java JHOVE supports
validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF
UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California
digital Library 2011]
The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation
of binary file formats As a matter of fact binary file grammars and parsers have been
constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-
based format TIFF However the research reported herein differs from that of the JHOVE
project in that what is sought is a technology for generating validators for binary file formats
from grammars specifying the binary file format
8 Conclusion
The research question addressed by his research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files Amajor family of binary file formats chunk-based was described Then extensions
to the concepts of context-free grammars and attribute grammars were described that enable the
specification of binary file formats Examples of attribute array grammars for two chunk-based
binary file format were then presented Recursive descent parsers for this classes of grammars
was then described Experience in using ANTLR a parser generator for LL() string grammars
in generating parsers for recognizing the formats of binary files was discussed Finally related
research in data description languages for binary file formats was described and the JHOVE
project in creating Java-based tools for validating file formats was described
22
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats Furthermore these grammars can be used with recursive descent
parsers for validating the file formats of chunk-based binary files It remains to be determined
whether these attribute grammars based on binary file (array) grammars are adequate to define
other families of binary file formats
To be a practical technology for generating recognizers (validators) for binary file formats a
parser generator is needed that creates from a binary file grammar a recursive descent parser
(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and
generates a lexer as well as a parser is too heavyweight a parser generator for this task and does
not have the requisite data types built in
A capability to validate binary file formats assumes that the file format of a file is already
known In this research we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009] We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendix a We also have file
signature tests (magic tests) for all these file types
We would like to have a parser generator (also called a compilerndashcompiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents The
closest parser generator that we have found to having these capabilities is ANTLR
(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar However it works with strings and generates a separate scanner
Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based
file formats We have specifications for each of file formats and samples of files with these
formats
These file formats have a similar structure We have developed binary file grammars for two of
these binary file formats We should be able to do so for all of them We used these grammars
with the extensions to ANTLR to generate top-down recursive descent parsers for the file
formats defined with these grammars
The significance of success in this research task is that if binary file formats can be specified
with binary file grammars then only one parsing generator is needed to generate the parsers for
verifying that a file conforms to a particular format Similarly the same parsing generator can be
used for conversion of legacy file formats to current or standard formats Finally the same parser
generator can be used for generating viewerplayers for most file formats This would increase
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
20
getUByteValue($b3) ltlt 8 |
getUByteValue($b4) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
defines the signed word data type
wordValue returns [int value]
b1 = BYTE b2 = BYTE
if (byteOrderequals(II))
$value = (int)
( getUByteValue($b1) |
getUByteValue($b2) ltlt 8 )
else
$value = (int)
( getUByteValue($b1) ltlt 8 |
getUByteValue($b2) )
catch[RecognitionException re]
reportError(re)
recover(inputre)
catch[UnsupportedEncodingException uee]
Systemoutprintln(Unsupported Encoding Exception)
Terminal rule used to create the Lexer
BYTE
Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format
7 Related Research
Researchers have developed a number of data description languages to support validation of file
formats EAST (Enhanced ADA Subset) is a data description language developed by the
Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary
Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic
information The EAST description is used to interpret and provide access to information in
binary and text files
DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to
manipulate Java jar files and ELF object files The PADSC data description language can be
used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC
compiler compiles the description into tools that can be used to recognize manipulate and
transform the data into other formats
21
The Data Format Description Language (DFDL) is being developed by a Working Group of the
Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML
Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-
XML physical representations Version 1 of the language specification was published in 2011
and a parser is being implemented
Each of the data description languages described above can be used to define data types and file
structures of binary files However they are based on file structure declarations such as those of
C or Java The binary file format grammar described in this paper most closely resembles DFDL
However the binary file grammar described in this paper is the only data description language
based on formal grammars that is used for creating recognizers for file formats
JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to
provide automated and efficient identification and validation of the formats of digital files
JHOVE is a format-specific digital object validation API written in Java JHOVE supports
validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF
UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California
digital Library 2011]
The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation
of binary file formats As a matter of fact binary file grammars and parsers have been
constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-
based format TIFF However the research reported herein differs from that of the JHOVE
project in that what is sought is a technology for generating validators for binary file formats
from grammars specifying the binary file format
8 Conclusion
The research question addressed by his research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files Amajor family of binary file formats chunk-based was described Then extensions
to the concepts of context-free grammars and attribute grammars were described that enable the
specification of binary file formats Examples of attribute array grammars for two chunk-based
binary file format were then presented Recursive descent parsers for this classes of grammars
was then described Experience in using ANTLR a parser generator for LL() string grammars
in generating parsers for recognizing the formats of binary files was discussed Finally related
research in data description languages for binary file formats was described and the JHOVE
project in creating Java-based tools for validating file formats was described
22
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats Furthermore these grammars can be used with recursive descent
parsers for validating the file formats of chunk-based binary files It remains to be determined
whether these attribute grammars based on binary file (array) grammars are adequate to define
other families of binary file formats
To be a practical technology for generating recognizers (validators) for binary file formats a
parser generator is needed that creates from a binary file grammar a recursive descent parser
(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and
generates a lexer as well as a parser is too heavyweight a parser generator for this task and does
not have the requisite data types built in
A capability to validate binary file formats assumes that the file format of a file is already
known In this research we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009] We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendix a We also have file
signature tests (magic tests) for all these file types
We would like to have a parser generator (also called a compilerndashcompiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents The
closest parser generator that we have found to having these capabilities is ANTLR
(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar However it works with strings and generates a separate scanner
Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based
file formats We have specifications for each of file formats and samples of files with these
formats
These file formats have a similar structure We have developed binary file grammars for two of
these binary file formats We should be able to do so for all of them We used these grammars
with the extensions to ANTLR to generate top-down recursive descent parsers for the file
formats defined with these grammars
The significance of success in this research task is that if binary file formats can be specified
with binary file grammars then only one parsing generator is needed to generate the parsers for
verifying that a file conforms to a particular format Similarly the same parsing generator can be
used for conversion of legacy file formats to current or standard formats Finally the same parser
generator can be used for generating viewerplayers for most file formats This would increase
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
21
The Data Format Description Language (DFDL) is being developed by a Working Group of the
Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML
Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-
XML physical representations Version 1 of the language specification was published in 2011
and a parser is being implemented
Each of the data description languages described above can be used to define data types and file
structures of binary files However they are based on file structure declarations such as those of
C or Java The binary file format grammar described in this paper most closely resembles DFDL
However the binary file grammar described in this paper is the only data description language
based on formal grammars that is used for creating recognizers for file formats
JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to
provide automated and efficient identification and validation of the formats of digital files
JHOVE is a format-specific digital object validation API written in Java JHOVE supports
validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF
UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California
digital Library 2011]
The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation
of binary file formats As a matter of fact binary file grammars and parsers have been
constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-
based format TIFF However the research reported herein differs from that of the JHOVE
project in that what is sought is a technology for generating validators for binary file formats
from grammars specifying the binary file format
8 Conclusion
The research question addressed by his research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files Amajor family of binary file formats chunk-based was described Then extensions
to the concepts of context-free grammars and attribute grammars were described that enable the
specification of binary file formats Examples of attribute array grammars for two chunk-based
binary file format were then presented Recursive descent parsers for this classes of grammars
was then described Experience in using ANTLR a parser generator for LL() string grammars
in generating parsers for recognizing the formats of binary files was discussed Finally related
research in data description languages for binary file formats was described and the JHOVE
project in creating Java-based tools for validating file formats was described
22
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats Furthermore these grammars can be used with recursive descent
parsers for validating the file formats of chunk-based binary files It remains to be determined
whether these attribute grammars based on binary file (array) grammars are adequate to define
other families of binary file formats
To be a practical technology for generating recognizers (validators) for binary file formats a
parser generator is needed that creates from a binary file grammar a recursive descent parser
(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and
generates a lexer as well as a parser is too heavyweight a parser generator for this task and does
not have the requisite data types built in
A capability to validate binary file formats assumes that the file format of a file is already
known In this research we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009] We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendix a We also have file
signature tests (magic tests) for all these file types
We would like to have a parser generator (also called a compilerndashcompiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents The
closest parser generator that we have found to having these capabilities is ANTLR
(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar However it works with strings and generates a separate scanner
Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based
file formats We have specifications for each of file formats and samples of files with these
formats
These file formats have a similar structure We have developed binary file grammars for two of
these binary file formats We should be able to do so for all of them We used these grammars
with the extensions to ANTLR to generate top-down recursive descent parsers for the file
formats defined with these grammars
The significance of success in this research task is that if binary file formats can be specified
with binary file grammars then only one parsing generator is needed to generate the parsers for
verifying that a file conforms to a particular format Similarly the same parsing generator can be
used for conversion of legacy file formats to current or standard formats Finally the same parser
generator can be used for generating viewerplayers for most file formats This would increase
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
22
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats Furthermore these grammars can be used with recursive descent
parsers for validating the file formats of chunk-based binary files It remains to be determined
whether these attribute grammars based on binary file (array) grammars are adequate to define
other families of binary file formats
To be a practical technology for generating recognizers (validators) for binary file formats a
parser generator is needed that creates from a binary file grammar a recursive descent parser
(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and
generates a lexer as well as a parser is too heavyweight a parser generator for this task and does
not have the requisite data types built in
A capability to validate binary file formats assumes that the file format of a file is already
known In this research we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009] We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendix a We also have file
signature tests (magic tests) for all these file types
We would like to have a parser generator (also called a compilerndashcompiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents The
closest parser generator that we have found to having these capabilities is ANTLR
(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar However it works with strings and generates a separate scanner
Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based
file formats We have specifications for each of file formats and samples of files with these
formats
These file formats have a similar structure We have developed binary file grammars for two of
these binary file formats We should be able to do so for all of them We used these grammars
with the extensions to ANTLR to generate top-down recursive descent parsers for the file
formats defined with these grammars
The significance of success in this research task is that if binary file formats can be specified
with binary file grammars then only one parsing generator is needed to generate the parsers for
verifying that a file conforms to a particular format Similarly the same parsing generator can be
used for conversion of legacy file formats to current or standard formats Finally the same parser
generator can be used for generating viewerplayers for most file formats This would increase
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
23
the likelihood of preserving and making available into the indefinite future hose digital records
encoded in binary file formats
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
24
References
[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal
of the ACM Vol 15 Issue 4 Oct 1968
[Alias-wavefront 1999] Maya File Formats Maya 20 1999
httpcaadarchethzchinfomayamanual
[Amigan 2011] Amigan Software IFF FORM and Chunk Registry
httpamigan1emunetregiffhtml
[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description
May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats
Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21
[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files
Version 13 Apple Computer Inc January 1989
wwwaesnashvilleorgPDFsAIFF20Specspdf
[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-
webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt
[Apple 2005] Apple Core Audio Format Specification 2005b
httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_
specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101
[Apple 2007] QuickTime File Format Specification
httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref
acehtml
[ASF 2001] ASF format version 10 reconstruction January 2001
httpavifilesourceforgenetasf-10htm
[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January
1997 wwwdcsedacukhomemxrgfx3d3DSspec
[Autodesk 2009] Autodesk DXF Reference January 2009
httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
25
[Back 2002] G Back (2002) A specification and scripting language for binary data Generative
Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-
77
[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V
22 1979 pp 243-256
[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002
httpsimtechsourceforgenettechiffhtml
[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New
Horizons Software Inc 1987 httpamigan1emunetregWORDtxt
[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995
ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt
[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010
httpondasourceforgenetnlfnlfFormathtml
[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file
formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml
[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011
httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf
[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July
2002 httpwwwcaligaricomdownloadzipscncob65zip
[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-
t81pdf
[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description
Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf
[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998
[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt
[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D
Taliesin Inc httpamigan1emunetregDR2Dtxt
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
26
[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG
decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt
[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of
file description languages for file format identification and validation Proc PV 2007
Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-
detailsw=50089
[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format
Tech 3285 July 2001 httptechebuchdocstechtech3285pdf
Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml
[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd
May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html
[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495
wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml
[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for
processing ad hoc data ACM Conference on Programming language Design and
Implementation ACM Press httpwwwpadsprojorgpaperspldipdf
[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG
1995
[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal
Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September
1997 wwwgnelsondemoncoukzspecquetzalhtml
[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called
Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)
httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt
[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03
httpwwwanim8orcomresourcesan8_formattxt
[Google 2010] WebP RIFF Container 2010
httpcodegooglecomspeedwebpdocsriff_containerhtml
[Haaland] Mike Haaland Flic Files (FLI) Format description
httpwwwmartinreddynetgfxanimFLItxt
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
27
[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube
Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice
Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt
[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata
Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1
1995
[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)
August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21
[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data
Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt
wwwtactilemediacominfoMCI_Control_Infohtml
httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf
[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec
11 2008 httpduskblueorgprojtoymidimidiformatpdf
[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10
ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf
[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable
Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C
Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110
[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding
system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf
[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software
November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application
Version 50) wwwjasccomsupportkbarticlespspspecasp
[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version
22 April 2002
httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf
[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994
httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
28
[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory
vol 2 pp 127_145 1968
[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical
Systems Theory vol 5 pp 95_96 1971
[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)
North-Holland Publ Co Amsterdam 1971 pp 95-109
[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language
Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml
httpwebcswpiedu~kalcoursescompilersmodule4mysahtml
[LightWave 1996] LightWave 3D Object File Format Oct 16 1996
httpsandboxdeosglightwavehtm
[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001
httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml
[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer
1997 httppaulbourkenetdataformatscinema4dcinema4dpdf
[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and
Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf
[Maishein] Max Maischein PIC(PRO) Format
http14791177212extrafileformatgraphicspicpicprotxt
[Martin 2007] Michael Martin SVAF format description Version 02 January 2007
httpsvafsourceforgenetsvafhtml
[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and
Data Techniques July 29 1992 Revision 1097
httpnetghostnarodrugffvendspecmicriffms_rifftxt
[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New
Multimedia Data Types and Data Techniques Revision 30 April 15 1994
[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE
httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-
usstreamhhstreamaudprop_7wz7asp
[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003
December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
29
[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-
uslibraryms899421aspx
[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files
Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt
[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17
1986 wwwfine-viewcomjplabsdocilbmtxt
[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc
February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus
[MultimediaWiki 2006] Smush Animation Format August 2006
httpwikimultimediacxindexphptitle=SANChunk_Forma
[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6 April 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV August 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ November 2008
httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM January 2008
httpwikimultimediacxindexphptitle=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI February 2009
httpwikimultimediacxindexphptitle=Electronic_Arts_TQI
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
30
[MultimediaWiki 2009b] SANM October 2009
httpwikimultimediacxindexphptitle=SANM
[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240
httpcodegooglecompll-parser-
generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42
[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter
written in Python in the open source graphics file convertor UniConvertor
httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer
[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification
January 31 2011 wwwogforgdocumentsGFD174pdf
[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988
wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt
[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser
generator Software Practice and Experience 25 7 (July) 789-810
httpwwwantlrorgarticle1055550346383antlrpdf
[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific
Languages Pragmatic
[Peters] Christoph Peters The Ultimate 3D model file format (v 20)
httpultimate3dorgU3DModelFileFormatV2htm
[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003
httpxiphorgoggdocrfc3533txt
[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20
wwweblongcomzarfblorb
httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt
[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011
wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)
Format Version 10 2001 wwwlibpngorgpubmngspecCredits
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
31
[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File
Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008
httpscopenofficeorgexcelfileformatpdf
[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF
[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments
3072 March 2001 wwwietforgrfcrfc3072txt
[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft
26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt
[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and
old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21
[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988
httpamigan1emunetregANIMtxt
[Tamir 2005] Giora Tamir C RIFF Parser June 2005
wwwcodeprojectcomKBfilesriffparseraspx
[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications
In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi
Khosrow-Pour Information Science Publishing pp 268-273 2008
wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf
[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for
File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute
September 2009
[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical
Engineering The University of Queensland 1996
httpiteeuqeduau~cristinastudentsdaviddavidhtml
[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language
understanding system ARPA Human Language Technologies Workshop Plainsboro New
Jersey httpaclldcupenneduHH94H94-1039pdf
[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars
Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)
Springer-Verlag Berlin 1974 pp 9-16
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
32
[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley
1995 (chapter 10) (Attribute Grammars)
[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010
httpxiphorgvorbisdocVorbis_I_spechtml
[Zappe 1994] H Zappe Oktalyzer 1994
httpwwwwotsitorglistaspsearch=oktampbutton=GO21
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
33
Appendix A Chunk-based Binary File Formats
This appendix references specifications for about 75 chunk-based binary file formats and
discusses some of those specifications
A1 Electronic Arts Interchange File Format (IFF)
[Amigan 2011]
Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the
specification for the Interchange File Format [Morrison 1985]
A11 Interleaved Bitmap (ILBM)
[Morrison 1986]
This file format was discussed in section 41 of this report
A12 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985] The following is a re-
expression of that specification as a context-free binary file grammar
Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata
cksize rarr ULONG
ckdata rarr voice8header bodyck
ckdata rarr voice8header nameck copyrightck bodyck
ckdata rarr voice8header copyrightck nameck bodyck
ckdata rarr voice8header nameck bodyck
ckdata rarr voice8header copyrightck bodyck
ckdata rarr voice8header authorck bodyck
ckdata rarr voice8header annotation bodyck
ckdata rarr voice8header atakck releaseck bodyck
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
34
voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples
samplesPerHiCycle ctOctave sCompression Fixedevolume
oneShotHiSamples rarr ULONG
repeatHiSamples rarr ULONG
samplesPerHiCycle rarr ULONG
samplesPerSec rarr UWORD
ctOctave rarr UBYTE
sCompression rarr UBYTE
Fixed volume rarr UBYTE
nameprop rarr ―NAME cksize CHAR[size]
copyrightck rarr (c) cksize CHAR[size]
authorck rarr ―AUTH cksize CHAR[size]
annotation rarr annotationck annotation
annotation rarr annotationck
annotationck rarr ―ANNO cksize CHAR[size]
atakckrarr ―ATAK cksize EGpoint[size]
releaseck rarr ―RLSE cksize EGpoint[size]
EGpoint rarr duration dest
duration rarr UWORD
dest rarr Fixed
bodyck rarr ―BODY cksize samples[size]
In general the BODY has ctOctave octaves of data The highest frequency octave comes first
comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave
contains twice as many samples as the next higher octave but the same number of cycles The
lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +
repeatHiSamples) The number of samples in the BODY chunk is
(20 + + 2
(ctOctave-1) (oneShotHiSamples + repeatHiSamples)
A13 Cel Animation (ANIM)
[Sparta 1988]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
35
A14 IFF Formatted Text (FTXT)
[Shaw et al 1985]
A15 IFF Simple Musical Score (SMUS)
[Morrison 1986b]
A16 Planar Bitmap (PBM)
[REWiki 2009]
A2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit
integers
Midi rarr ltheaderchunkgt lttrackchunksgt
ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt
ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt
lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt
lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt
lttrack datagt rarr ltMTrk eventgt+
ltMTrk eventgt rarr ltdelta-timegt lteventgt
lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt
ltMIDI eventgt rarr ltMIDI channel messagegt
ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt
ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt
ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt
A3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Artslsquo Interchange File Format The file format is specified using C data structure
definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF
grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file
AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
36
localhunks -gt CommonChunk SoundDataChunk
CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize
sampleRate
SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]
MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]
Markers -gt marker markers
Marker -gt markerid position markerName
InstrumentChunk -gt ldquoINSTrdquo cksize baseNote
detune
lowNote
highNote
` lowVelocity
` ` highVelocity
gain
sustainLoop
releaseLoop
MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]
AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]
ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]
CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]
Comments -gt Comment Comments
Comment -gt timeStamp
marker
count
text
The name author copyright and annotation chunks are included in the definition of every EA
IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is
optional
nameprop -gt ldquoNAMErdquo size CHAR[size]
copyrightck -gt ldquo(c) ldquo size CHAR[size]
authorck -gt ldquoAUTHrdquo size CHAR[size]
annotation -gt annotationck annotation
annotation -gt annotationck
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
37
annotationck -gt ldquoANNOrdquo size CHAR[size]
A4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a
container
A41 RIFF-Waveform (WAVE)
[IBM amp Microsoft 1991]
This file format was discussed in section 42 of this report
A42 WAVEFORMATEX
[Microsoft 1994]
A43 WAVEFORMATEXTENSIBLE
[Microsoft 2003b]
A44 Broadcast Wave Format
[EBU 2001]
A45 Windows Audio-Video Interleave (AVI)
[Microsoft 2005]
A46 Animated Windows Cursor (ANI)
[Houghtaling 1996]
A47 RIFF MIDIfile (RMID)
[IBM amp Microsoft 1991]
A48 Device-Independent Bitmap (RIFF RDIB)
The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft
1991]
A49 WebP
WebP is a method of lossy compression for digital images The degree of compression is
adjustable so a user can choose the trade-off between file size and image quality WebP typically
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
38
achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image
quality A WebP file consists of VP8 image data and a container based on RIFF The format of a
WebP file is shown below [Google 2010]
0ffset Description
0 RIFF 4-byte tag
4 size of data chunk sarting at offset 8
8 WEBP the form-type signature
12 VP8 4-bytes tag describing the raw video format used
16 size of the raw VP8 image data chunk starting at offset 20
20 the VP8 image data
A5 JPEG
A51 JPEGJFIF
[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]
JPEGJFIF file format
- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)
- for JFIF files an APP0 segment is immediately following the SOI marker see below
- any number of segments (similar to IFF chunks) see below
- trailer (2 bytes) $ff $d9 (EOI)
Segment format
- header (4 bytes)
$ff identifies segment
n type of segment (one byte)
sh sl size of the segment including these two bytes but not
including the $ff and the type byte Note not Intel order
high byte first low byte last
A52 JPEGExif
[JEIDA 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
39
A53 JPEG 2000
―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte
denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these
markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC
10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a
marker and associated parameters called marker parameters In every marker segment the first
two bytes after the marker shall be an unsigned big endian integer value that denotes the length
in bytes of the marker parameters (including two bytes of this length parameter but not the two
bytes of the marker itself) [ISO 2000 page 13]
A54 JPEG 2000 Codestream
[ISO 2000] Annex A Codestream syntax
A6 Advanced Systems Format
Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format
The most common file formats contained within an ASF file are Windows Media Audio (WMA)
and Windows Media Video (WMV)
A61 Windows Media Audio 7-9
Microsoft 2004]
A62 Windows Media Video 7-91
[Microsoft 2004]
A7 Portable Network Graphics
[ISO 159482003]
PNG file -gt PNGSignature headerck paletteck
headerck -gt IHDR cksize Width Height BitDepth ColourType
Compressionmethod Filtermethod Interlacemethod
paletteck -gt PLTE cksize entries
entries -gt entry entries
entry -gt red green blue
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
40
A71 Multiple-image Network Graphics
[Randers-Pehrson 2001]
A72 JPEG Network Graphics
[Randers-Pehrson 2001]
A8 Binary Interchange File Format
[Rentz 2008]
A9 3D Studio
A91 3D Studio File Format ndash 3ds
[Autodesk 1997]
A92 3D Studio Material Library Files
[Fercoq 1996]
A10 Anim8or
[Glanville 2003]
A11 Animator
[Crazy Talk]
A111 Animator Cel
[Crazy Talk]
A112 Animator Pro Fli
[Haaland]
A113 Animator Pro fLC
A111 Animator Cel
[Crazy Talk]
A114 Animator Pro PIC
[Maischein]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
41
A12 Audio Manager Module
[Foo 1995]
A13 Autodesk Drawing Interchange Files (DXF) (Binary)
[Autodesk 2009]
A14 Caligari TrueSpace SceneObject (Binary)
[Caligari 1998]
A15 CorelDRAW Vector Graphics File
(Versions 30 ndash X3)
[Novikov and Fillipov]
A16 Corel Metafile Exchange Image
[Corel 1998]
A17 Music Modules
A171 DigiTrekker Module Format
[Beham 1995]
A172 Graoumf Tracker modules
[Soras 1995]
A173 Oktalyzer
[Zappa 1994]
A18 EA Multimedia Game File Formats
[Anisimovsky 2002a]
A181 Electronic Arts Music (asf)
[Anisimovsky 2002b]
A182 Electronic Arts CMV
[MultimediaWiki 2008a]
A183 Electronic Arts DCT
[MultimediaWiki 2008b]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
42
A184 Electronic Arts MAD
[MultimediaWiki 2008c]
A185 Electronic Arts MPC
[MultimediaWiki 2008d]
A186 Electronic Arts VP7
[MultimediaWiki 2008e]
A187 Electronic Arts TGV
[MultimediaWiki 2008f]
A188 Electronic Arts TGQ
[MultimediaWiki 2008g]
A189 Electronic Arts TQI
[MultimediaWiki 2009]
A19 Imagine 3D Data Description (TDDD)
[Kirvan 1994]
A20 Lightwave 3D Objects
A201 LWOB
[LightWave 1996]
A202 LWO2
[Lightwave 2001a]
A203 LWLO
[Ferguson 1995]
A21 Luxology Modo 3D Object (LXOB)
[Luxology 2006]
A22 Paint Shop Pro (PSP)
[Jasc 1998]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
43
A23 ProVector 2D Drawing
[Cunniff and Orr]
A24 ProWrite Document
[Bayless 1987]
A25 InteractiveFiction
A251 Blorb Z-machine Resource File
[Plotkin]
A252 Quetzal
[Frost 1997a 1997b]
A26 Apple QuickTime
[Apple 2007]
A27 Maya IFF Bitmap
[Alias-wavefront 1999]
A28 Apple Core Audio Format
[Apple 2005]
A29 American Laser Games (MM)
[MultimediaWiki 2008h]
A30 Smush Animation Format
A301 SAN
[MultimediaWiki 2006]
A302 SNM
[MultimediaWiki 2009b]
A31 Structured Data eXchange Format (SDXF)
[RFC 3072]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]
44
A32 Ogg Vorbis
[Pfeiffer 2003] [Xiph 2010]
A33 Simple Vector Array Format
[Martin 2007]
A34 Nested List File
Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by
Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]
A35 Cinema 4D
[Losch 1997]
A36 The Sims Interchange File Format
Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]