reversing data formats: what data can reveal

Reversing Data Formats:What Data Can Reveal

by Anton Dorfman ZeroNights 0x03, Moscow, 2013

About me

• Fan of & Fun with Assembly language• Researcher• Scientist• Teach Reverse Engineering since 2001• Candidate of technical science

Observed topics

• Samples• Basics• Thoughts• Fields• Related works• Applications

Samples

What is it?

What is it? Hint 1

What is it? Hint 2

What is it?

What is it? Hint 1

What is it? Hint 2

Basics

Standard way

• Hex editor• Researcher• Brain (equally important as the Hex editor)• Basic knowledge how data can be organized

(in brain)• Analysis of the executable file that

manipulates with data format

3 ways of researching

• Dynamic analysis – use reverse engineering of application that manipulate with data format

• Static analysis – try to extract header, structures, fields and try to find relationship between data

• Statistic analysis – when have some amount of samples and use statistics of changes and ranges of values

Breaking the rules

• There is the rule RTFM (Read The F**king Manual)

• Nobody likes it• I’m not exception

• First of all I start my research, and second – try to find related works and generalize ideas from them

Definitions

• Field – some value used to describe Data Format

• Structure – way for organizing various fields• Data – information for representing which

Data Format is developed• Header – common structure before data, may

contain substructures

Thoughts

Main Idea

• Data format is developed by human (not pets or aliens)

• Sometimes looks like that no human works on it…• Format developers use common data organization

concepts and similar thoughts when creating new data formats

• If we find regularities in data format organization rules we can automate searching of them

Three types of data format

• File formats – multimedia formats, database formats, internal formats for exchanging between program components and etc.

• Protocols – network protocols, hardware device interaction protocols, protocols of interaction between driver and user space application and etc.

• Structures in memory – OS structures, application structures and etc.

Levels of abstraction

• Bit Order• Byte Order• Fields Size• Field Basic Type• Field Type• Structure• Field Semantics

Tasks to solve

• Extract header, separate it from data• Find field boundaries• Find structures and substructures• Find types of fields• Detect bit and byte ordering• Determine semantics of fields• Interpret goal of field – it’s semantics

Bit order

• There are most significant bit – MSB and least significant bit - LSB

• MSB is a left-most bit – commonly used – value is 10010101b – 95h - 149

• LSB is a left-most bit - value is 10101001b – A9h - 169

Byte order

• Little-endian – Intel byte order• Big-endian – network byte order

• Middle-endian or mixed-endian – mixed byte order – when length of value more then default processor machine word

Fields

Types of fields

• Service fields – for describing Data Format (size of structure and etc.)

• Common fields – “fields from life” (time, date and etc.)

• Specific fields – we can find range for that type (bit flags and etc.)

• Application specific – can be interpret only by application, we should use dynamic analysis

Field Size and Levels of field interpretation

• Commonly field is a byte sequence• Sometimes field is a bit sequence

• Fixed size in bytes (1, 2, 3, 4, …)• Fixed size in bits (1, 2, 3, 4, …)• Variable size in bytes• Variable size in bits

Basic field types

• Hex value – by default• Decimal value• Character value (up to 4 symbols)• String (ASCII or Unicode)• Float value• Bits value

Types of field

• Text• Offset• Size• Quantity• ID• Flag• Counter• DateTime• Protect

Text

• Usually fixed size, remaining space after text filled with 0

• Variable sized text ended with 0, or there are size of this text field

• Usually text string contain their meanings

Offset

• Offset to another field• Offset to data• Pointer to another structure (may be child or

parent)• Pointer to another instance of such structure

(next or previous in linked list)

• Can be relative and absolute

Export in PE-EXE

IMAGE_DOS_HEADER

+03Che_lfanew DWORD

...

...

Module ImageBase

+078h

IMAGE_NT_HEADERS

OptionalHeaderIMAGE_OPTIONAL_HEADER

Signature DWORD

IMAGE_DIRECTORY_ENTRY_EXPORT

FileHeader IMAGE_FILE_HEADER

...

+024h

IMAGE_EXPORT_DIRECTORY

AddressOfNames

...

AddressOfNameOrdinals

AddressOfFunctions

...

+020h

+01Ch

Names (Offsets)

Offset FuncName X+1

Offset FuncName X

Offset FuncName X+2

Name Ordinals

Index FuncAddr X+1

Index FuncAddr X

...

Functions Addresses

FuncAddr X+1

...

...

Function Names

...

‘LoadLibraryA’,0

...

...

Index FuncAddr X+2

...

‘LoadLibraryExA’,0

FuncAddr X

FuncAddr X+2

+ImageBase +ImageBase

...

+ImageBase

+ImageBase

+ImageBase

‘LoadLibraryExW’,0LoadLibraryExA

+ImageBase

LoadLibraryExA EP:

Index Index=(Elem size - dword)

(Elem size - word)

(Elem size - dword)

mov edi,edipush ebp...

+ImageBase

Find

Size

• Size of data• Size of structure• Size of substructure• Size of field

• Can be not only in bytes, but also in bits, words (2 byte), double words, paragraphs (16 byte) and etc.

Quantity

• Quantity of substructures• Quantity of elements

• IMAGE_NT_HEADERS.IMAGE_FILE_HEADER. NumberOfSections

• IMAGE_EXPORT_DIRECTORY. NumberOfFunctions

ID

• Identifier of data format, often called magic number

• ID of structure• ID of substructure

• Often ID is value consist of char symbols• Often ID looks like “magic” - BE BA FE CA

Flag

• Bit Flags• Enum values as a flags• Special case – Bool value

Counter

• Increment (possibly decrement)• Starts with 0 / begins with another value• Changes by 1 or another value

DateTime

• Storing format• Resolution• Moment of beginning• Range

Protect

• CRC value of data• CRC value of substructure• Various Hash functions and etc.

Related works

Projects, articles and authors• Protocol Informatics Project – based on “Network Protocol Analysis

using Bioinformatics Algorithms” by Marshall A. Beddoe• Discoverer - “Discoverer: Automatic Protocol Reverse Engineering

from Network Traces” by Weidong Cui, Jayanthkumar Kannan, Helen J. Wang

• Laika – “Digging For Data Structures” by Anthony Cozzie, Frank Stratton, Hui Xue, and Samuel T. King

• RolePlayer – “Protocol-Independent Adaptive Replay of Application Dialog” by Weidong Cuiy, Vern Paxsonz, Nicholas C. Weaverz, Randy H. Katzy

• Tupni – “Tupni: Automatic Reverse Engineering of Input Formats” by Weidong Cui, Marcus Peinado, Karl Chen, Helen J. Wang, Luiz Irun-Briz

Protocol Informatics Project• Global and local sequence alignment -

Needleman Wunsch and Smith Waterman algorithms

Discoverer

Laika

• Using Bayesian unsupervised learning• For fixed size structures only

Tupni

• Based on taint tracking engine

Applications

• RE any program• RE undocumented/proprietary file formats• RE undocumented/proprietary network protocols• RE undocumented structures in memory• Fuzzing • Examination of protocol implementation• Replay network interaction• Zero-day vulnerability signature generation

Thank you!

• Questions?

• Meet me at [email protected],• [email protected]

mailto:[email protected]

mailto:[email protected]

reversing data formats: what data can reveal

Documents