reversing data formats: what data can reveal
DESCRIPTION
Reversing Data Formats: What Data Can Reveal. by Anton Dorfman ZeroNights 0x03, Moscow, 2013. About me. Fan of & Fun with Assembly language Researcher Scientist Teach Reverse Engineering since 2001 Candidate of technical science. Observed topics. Samples Basics Thoughts Fields - PowerPoint PPT PresentationTRANSCRIPT
Reversing Data Formats:What Data Can Reveal
by Anton Dorfman ZeroNights 0x03, Moscow, 2013
About me
• Fan of & Fun with Assembly language• Researcher• Scientist• Teach Reverse Engineering since 2001• Candidate of technical science
Observed topics
• Samples• Basics• Thoughts• Fields• Related works• Applications
Samples
What is it?
What is it? Hint 1
What is it? Hint 2
What is it?
What is it?
What is it? Hint 1
What is it? Hint 2
Basics
Standard way
• Hex editor• Researcher• Brain (equally important as the Hex editor)• Basic knowledge how data can be organized
(in brain)• Analysis of the executable file that
manipulates with data format
3 ways of researching
• Dynamic analysis – use reverse engineering of application that manipulate with data format
• Static analysis – try to extract header, structures, fields and try to find relationship between data
• Statistic analysis – when have some amount of samples and use statistics of changes and ranges of values
Breaking the rules
• There is the rule RTFM (Read The F**king Manual)
• Nobody likes it• I’m not exception
• First of all I start my research, and second – try to find related works and generalize ideas from them
Definitions
• Field – some value used to describe Data Format
• Structure – way for organizing various fields• Data – information for representing which
Data Format is developed• Header – common structure before data, may
contain substructures
Thoughts
Main Idea
• Data format is developed by human (not pets or aliens)
• Sometimes looks like that no human works on it…• Format developers use common data organization
concepts and similar thoughts when creating new data formats
• If we find regularities in data format organization rules we can automate searching of them
Three types of data format
• File formats – multimedia formats, database formats, internal formats for exchanging between program components and etc.
• Protocols – network protocols, hardware device interaction protocols, protocols of interaction between driver and user space application and etc.
• Structures in memory – OS structures, application structures and etc.
Levels of abstraction
• Bit Order• Byte Order• Fields Size• Field Basic Type• Field Type• Structure• Field Semantics
Tasks to solve
• Extract header, separate it from data• Find field boundaries• Find structures and substructures• Find types of fields• Detect bit and byte ordering• Determine semantics of fields• Interpret goal of field – it’s semantics
Bit order
• There are most significant bit – MSB and least significant bit - LSB
• MSB is a left-most bit – commonly used – value is 10010101b – 95h - 149
• LSB is a left-most bit - value is 10101001b – A9h - 169
Byte order
• Little-endian – Intel byte order• Big-endian – network byte order
• Middle-endian or mixed-endian – mixed byte order – when length of value more then default processor machine word
Fields
Types of fields
• Service fields – for describing Data Format (size of structure and etc.)
• Common fields – “fields from life” (time, date and etc.)
• Specific fields – we can find range for that type (bit flags and etc.)
• Application specific – can be interpret only by application, we should use dynamic analysis
Field Size and Levels of field interpretation
• Commonly field is a byte sequence• Sometimes field is a bit sequence
• Fixed size in bytes (1, 2, 3, 4, …)• Fixed size in bits (1, 2, 3, 4, …)• Variable size in bytes• Variable size in bits
Basic field types
• Hex value – by default• Decimal value• Character value (up to 4 symbols)• String (ASCII or Unicode)• Float value• Bits value
Types of field
• Text• Offset• Size• Quantity• ID• Flag• Counter• DateTime• Protect
Text
• Usually fixed size, remaining space after text filled with 0
• Variable sized text ended with 0, or there are size of this text field
• Usually text string contain their meanings
Offset
• Offset to another field• Offset to data• Pointer to another structure (may be child or
parent)• Pointer to another instance of such structure
(next or previous in linked list)
• Can be relative and absolute
Export in PE-EXE
IMAGE_DOS_HEADER
+03Che_lfanew DWORD
...
...
Module ImageBase
+078h
IMAGE_NT_HEADERS
OptionalHeaderIMAGE_OPTIONAL_HEADER
Signature DWORD
IMAGE_DIRECTORY_ENTRY_EXPORT
FileHeader IMAGE_FILE_HEADER
...
+024h
IMAGE_EXPORT_DIRECTORY
AddressOfNames
...
AddressOfNameOrdinals
AddressOfFunctions
...
+020h
+01Ch
Names (Offsets)
Offset FuncName X+1
Offset FuncName X
Offset FuncName X+2
Name Ordinals
Index FuncAddr X+1
Index FuncAddr X
...
Functions Addresses
FuncAddr X+1
...
...
Function Names
...
‘LoadLibraryA’,0
...
...
Index FuncAddr X+2
...
‘LoadLibraryExA’,0
FuncAddr X
FuncAddr X+2
+ImageBase +ImageBase
...
+ImageBase
+ImageBase
+ImageBase
‘LoadLibraryExW’,0LoadLibraryExA
+ImageBase
LoadLibraryExA EP:
Index Index=(Elem size - dword)
(Elem size - word)
(Elem size - dword)
mov edi,edipush ebp...
+ImageBase
Find
Size
• Size of data• Size of structure• Size of substructure• Size of field
• Can be not only in bytes, but also in bits, words (2 byte), double words, paragraphs (16 byte) and etc.
Quantity
• Quantity of substructures• Quantity of elements
• IMAGE_NT_HEADERS.IMAGE_FILE_HEADER. NumberOfSections
• IMAGE_EXPORT_DIRECTORY. NumberOfFunctions
ID
• Identifier of data format, often called magic number
• ID of structure• ID of substructure
• Often ID is value consist of char symbols• Often ID looks like “magic” - BE BA FE CA
Flag
• Bit Flags• Enum values as a flags• Special case – Bool value
Counter
• Increment (possibly decrement)• Starts with 0 / begins with another value• Changes by 1 or another value
DateTime
• Storing format• Resolution• Moment of beginning• Range
Protect
• CRC value of data• CRC value of substructure• Various Hash functions and etc.
Related works
Projects, articles and authors• Protocol Informatics Project – based on “Network Protocol Analysis
using Bioinformatics Algorithms” by Marshall A. Beddoe• Discoverer - “Discoverer: Automatic Protocol Reverse Engineering
from Network Traces” by Weidong Cui, Jayanthkumar Kannan, Helen J. Wang
• Laika – “Digging For Data Structures” by Anthony Cozzie, Frank Stratton, Hui Xue, and Samuel T. King
• RolePlayer – “Protocol-Independent Adaptive Replay of Application Dialog” by Weidong Cuiy, Vern Paxsonz, Nicholas C. Weaverz, Randy H. Katzy
• Tupni – “Tupni: Automatic Reverse Engineering of Input Formats” by Weidong Cui, Marcus Peinado, Karl Chen, Helen J. Wang, Luiz Irun-Briz
Protocol Informatics Project• Global and local sequence alignment -
Needleman Wunsch and Smith Waterman algorithms
Discoverer
Laika
• Using Bayesian unsupervised learning• For fixed size structures only
Tupni
• Based on taint tracking engine
Applications
• RE any program• RE undocumented/proprietary file formats• RE undocumented/proprietary network protocols• RE undocumented structures in memory• Fuzzing • Examination of protocol implementation• Replay network interaction• Zero-day vulnerability signature generation