trace surfing - ekoparty · trace surfing a tale of data structure recovering and other yerbas by...

Trace SurfingA tale of data structure recovering and other yerbas

By Agustin Gianni – Immunity Inc.

Problem Statement

Given a memory trace, what information does the trace gives us about the underlying data

structures?

Road map

● Investigation of previous approaches● Realization that they kind of suck● Enlightenment phase how can we improve→

Introduction

● What is a memory trace? ● A memory trace is a collection of all the memory

accesses performed by an application.– Both reads and writes

● How can I obtain a memory trace?● Binary Instrumentation

– pintool – DynamoRIO

● Full system emulation– QEMU– BOCHS

Example Memory Trace

# White listed image `calc.exe` # Loading hooks from file hooks.hks # Loaded hook alloc:test_custom_alloc:00000774:0:my_alloc_ # Loaded hook free:test_custom_alloc:000007b6:0:my_free_ L:calc.exe:0x003a0000:0x0045ffff # Thread 0x0 started # Instrumented malloc at 0x75619cee # Instrumented free at 0x75619894 # Instrumented realloc at 0x7561b10d # Instrumented calloc at 0x7561c456 W:0x003a76c6:0x01d125e0:0x01d125e0:0x00000004:0x0000000f W:0x003a76cc:0x01d125e0:0x01d125e4:0x00000004:0x0000000f … F:0x003b8f9a I:0x003b8f9a:0x00000031:0x00000000 F:0x003b8fdc I:0x003b8fdc:0x00000031:0x00000000 # Thread 0x1 did not finish but application exited.

Introduction

● Why do we care about recovering data structures?● Large binaries are a pain to reverse

– Specially Object Oriented Code● Virtual Function Tables and friends

● Makes reverse engineering happier● Saves time

● Why not?● Computers got fast enough to trace every single

memory access

HexRays – With data types

Introduction

● Has anyone approached the problem?● Dynamic analysis

– Howard: Dynamic Excavator for Reverse Engineering Data Structures

– Rewards: DDE, Dynamic Data Structure Excavation● Static analysis

– WYSINWYX: What You See Is Not What You eXecute● Based on abstract interpretation, blah, blah, blah!

The Rewards / Howard approach

● Trace every single memory access● Heap● Stack

● Define type sinks● System Calls● Library Calls● Special purpose instructions

– For instance, string manipulation instructions on Intel architecture.

● Propagate recovered types● Analyze the memory trace

Type Sinks

● A type sink is a function, syscall or instruction that we know which types it is taking

● System calls and standard libraries are the more verbose● For instance:

– ssize_t read(int fd, void *buf, size_t count);– Leaks four types: ssize_t, int, void *, size_t– Also we can extract semantics

● We know that 'fd' is a descriptor● 'buf' is a buffer● Etc.

Type Sinks

● Instructions can also leak types● Intel String Operations

– CMPS, INS, LODS, MOVS, OUTS, SCAS, STOS● Intel Floating Point Instructions

– FADD, FDIV, FMUL, and so on.● Jumps

– JG / JL Signed Integers→– JA / JB Unsigned Integers→

● Memory dereferences– Data dereferences leak half a type

● We just know the dereferenced address is a pointer

– Indirect calls leak function pointer types● We know that the dereferenced address contains a pointer to a function.

What do we want to recognize?

● Things to recognize:● Structures / Classes● Arrays● Pointers

● How?● Study how the memory is accessed

Identifying Pointers

● Pointers are 'easy' to detect● Just see what instructions dereference memory● The dereferenced argument must be a valid pointer

– Otherwise the program would crash● Problem

– We cannot yet know the type of the pointer– If we are lucky enough, and by lucky I mean that we have

sufficient code coverage, we will identify the type of the pointer.

Warning : we are entering the terrain of the incomplete and unsound assumptions.

Absolute correctness

● Do we really care about absolute correctness? ● Hint I don't→● Even if we could automatically identify a fraction of

the types correctly, that saves us work.● Eventually decisions/corrections must be done

● Inconsistent typing is detected by humans● We are not aiming to solve unsolvable problems

● We cannot get back what is not there– Compilation is not bidirectional

● Although Rolf may argue this I've been told ;)

Identifying Structures

● Typically structure fields are accessed in an indirect way● This depends heavily on the compiler and the

optimization level.● Often, access patterns will be similar.

● Example● Let A be a base pointer● *(A + 0) is the first field● *(A + 8) is the second field● And so on


● What we want to do is to detect indirect memory addresses.● We can obtain this from a memory trace

● But …● What if A was not a structure

– Let A be an array– *(A + 0) is the first element– *(A + 8) is the second element– And … we are screwed

● Also, sometimes structure fields are accessed directly– There is no base pointer


● There is no way we can decide, with certainty, whether a pointer points to a structure or an array● We have to make unsound assumptions● Rely on compiler specific constructs● Heuristics● And why not a bit of magic

● In the end, manual work needs to be done● Still, less work than reversing manually


● To distinguish between arrays and structures we use some heuristics● Memory accesses are generally scattered

– Example:● Access field at offset 0x00● Then offset 0x10● And so on

● Size of the access is generally heterogeneous– Example:

● Access field 2 which is an integer● Then access field 3 which is a short integer● Etc.

Identifying Structures - Example

● Memory accesses● 1 – DWORD● 2 – DWORD● 3 – WORD● 4 – WORD● 5 – DWORD● 6 – BYTE● 7 – WORD

6

2

3

5

6

7

1

4


● There are a considerable amount of cases where this will fail● The most trivial cases

– Initializing a structure with “memset”– Copying a structure with “memcpy”

● How do we solve this– If we have more than one access pattern, favor the more

irregular

Identifying Arrays

● We can identify arrays by watching memory accesses on loops● There are two cases

– Sequential memory accesses– Random memory accesses

Identifying Arrays

● Sequential memory accesses● Let A be a pointer● We are on a loop● A is dereferenced at loop cycle one.● B is generated also at loop cycle one.● Next iteration● B is dereferenced.● A is likely an array pointer

Identifying Arrays

● Random memory accesses● If all the accesses are of the same size we have a

hint that we are dealing with an array.● But it is also likely that it could be an structure.● This is getting hairy.

So, where are we?

Where are we?

● Detecting whether a pointer points to an array or a structure is essentially an educated guess.● We need to further “educate” ourselves● We need to have stronger assumptions that we can rely on.

● Tracing stack memory accesses is tricky● What about address reutilization

– We need to tag every address with a TAG to differentiate two identical addresses accessed in different times

● Tracing all memory accesses is painfully slow● We are interested in large binaries

Are we screwed then?

● Not really● We need to make our analysis a little bit more

specific● Hence less complete● But more accurate

● It is all about giving up a bit of generality for a bit more of accuracy

Looking for better waves

Focus on Heap Objects

● Why?● Heap objects are shared. We like data that is shared

– It leads to good things from an vulnerability research point● We have more information

– “malloc” like functions give us the size of the chunks● It is easier to track heap memory

– Hook allocation routines and tag the returned memory with a unique id– Hook also deallocation routines to keep track of valid memory chunks

Object Oriented Code

● Objects are basically structures with methods● Each object method needs to somehow reference its underlying object.● Objects of a given class share a set of common characteristics

● Most of them come from the heap● Or at least those object with shared state information

● So if we focus on objects, the problem is a bit less complicated● We are dealing with structures of know size● Now the whole address space is reduced to a fraction of its size

– Just analyze the .heap● Keeping track of the life of a heap memory region is simple

– Hook the allocation routine The block is alive→– Hook the free routine The block is dead→

How to detect objects?

● Not every single heap chunk is an object● Heuristics!

● Take advantage of calling conventions– Visual Studio: will set the 'ecx' register to the 'this' pointer– GCC 32 bits: pushes as the first method argument the object – GCC 64 bits: 'rsi' is set to the 'this' pointer

● So we mark every tracked heap chunk that is on “ecx”, “rsi” or the first argument of a function as a possible object

● The object must be used inside the potential method

How to detect methods

● There is no sound way● We have to trust our heuristics

● Which are better than most Anti-Virus heuristics :P● We are going to miss some methods

● The dynamic nature of a trace makes us rely on code coverage.● We are going to mark some functions as methods

● Sometimes the this pointer remains spuriously in 'ecx'

So, how are we now?

We are doing better!

● We can detect “interesting objects”● We know its size● We know where they are being used

● What else we need to do?● Detect fields● Detect relationships with other types

– Inheritance– Composition

Detecting Object Fields

● We already have all heap memory accesses in our trace● If the memory access is to one of our interesting objects we

save the access offset and size● Since we only track interesting objects the analysis is much

quicker● We can implement the algorithms used by Howard/Rewards

● If we have information from type sinks, we can propagate it

Detecting methods

● On each function call check if ECX points to a heap object.● If true

– Mark the chunk as interesting– Save the access offset for future usage

● Mark the function as interesting● Does this function get called again with the same

conditions?– That is, the same function gets called with a chunk of the

same size as the 'this' parameter

How far can we go?

How far can we go?

● With all the collected traces we can obtain quite a lot of information● Class Hierarchy● Virtual Function Tables● Types!● Bonus (not really related with type inference)

– Code coverage information– Indirect branch resolution

How can we achieve this?How can we achieve this?

Virtual Function Tables

● Useful to help IDA Pro to discover more functions

● For each write to an interesting chunk● Is the value written referring to .text ?

– Is [value] also in .text?● This is for sure a Virtual Function Table

– If not, it is just a field update

Types

● Type reconstruction algorithm is divided in three phases● First Analysis Pass (FAP)

– Pun intended● Second Analysis Pass (SAP)● Third Analysis Pass (TAP)

First Analysis Pass

● For each function● Get all its interesting chunks

– That is chunks that were passed as the 'this' argument● Mark the whole chunk as a composite type

– Set the composite type size to the size of the chunk● If 'this' does not point to the first byte of the chunk,

get the offset– Divide the composite chunk in two types at the calculated

offset● Repeat the process with all the methods that used

the chunk and subdivide the composite type

First Analysis Pass

Composite Type

Chunk

chunk_address = Aecx_address = A + 0

Composite Type

TypeAOffset = 0

In this case, TypeA fills the whole composite type

First Analysis Pass

Composite Type

Chunk

chunk_address = Aecx_address = A + C

Composite Type

TypeA

TypeBOffset = C

In this case there are two types, we recognize this because there were twomethods called with 'this' pointing at the same memory chunk but ata different offset.

First Analysis Pass – continued

● Add the current function to a list of methods● For each write to the interesting chunk

● Add a field at the offset of the write● Mark the field with the corresponding basic type

according to the write size– For instance, a write of four bytes is marked as

“uint32_t”

First Analysis Pass – continued

● Collect a set of constraints● For each chunk that was received as the 'this'

argument build a map from the method address to a list of all the types created.

● This will be later used build relationships between types and subsequent merging of identical types

method_at_0xcafecafe

Type_A

Type_B

Type_C

Size = X_1

Size = X_1

Size = X_2

Second Analysis Pass

● Merge similar types● Cheat by first using the type constraints collected on the FAP phase

● How do we define similar?● They have the same size

– Equal types with differing sizes will be addressed in the third pass● They have equivalent fields

– That is, at offset O there is a type T of size S in both types● They share a set of methods

– How many? ● Let N be the number of methods in Type1● Let M be the number of methods in Type2● Let S be the number of shared methods● SimilarityIndex(N,M,S) = (S / (N+M)) * 100 ● If SimilarityIndex > SimilarityThreshold then they are similar

Third Analysis Pass

● There are types that share methods and fields but they differ in size

● What is going on?● There are two possible scenarios

– Type2 in inherits from Type1● len(Type2) > len(Type1) most of the times

– The type has an internal buffer● This is the case of for example strings in some browsers

Inheritance / Composition

● A simple inheritance relationship is translated into a composition of structures


ClassAField1Field2Field3Field4

ClassB

Field1Field2Field3




● Two classes of different size use the same method● The bigger one is likely the child class● The smallest one is likely the parent class● This heuristic can fail

● Say that we have a dynamically allocated buffer inside a class– Rare, weird, but it can and will happen

● Failure will generate an extra type but the relationships between the types will still be interesting and can be detected by a human once the information is imported into IDA Pro

Hard example :)

StringClass

StringMetadatauint32_t len

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA…AAAAAAAAAAAAAAA???????????????

● Example string class that will contain metadata and contents on the same chunk of memory

● Other recurring complex examples are hash tables

Increasing accuracyIncreasing accuracy

Increasing accuracy

● Accuracy of our approach is directly related with code coverage● The more code coverage, the more accuracy

● Increasing code coverage● The “smart” way

– We can tweak Klee (requires source code)– We can code our version of SAGE

● ???● Profit

● The “other” way– Fuzz the application like a 15 year old– Gather a set of input files (if possible) and calculate the set of files that gets the

maximum coverage

Static Analysis

● How can we further validate our results?● Detecting calling convention

● We have collected a fair amount of information, how can we propagate this information?● Propagating the type information into basic blocks

not executed on the trace● Or we can be lazy and let HexRays decompiler to do

it for us :)

Calling convention detection

● A spurious function calls can happen when a non method function is called on a method

● The function call can receive the 'this' pointer of the previous method call

● We avoid this case by ruling out all the function calls that do not behave as thiscall


● Given a function get its CFG

● Obtain a DAG (direct acyclic graph)

● Do a topological sort

● Assume ECX is a 'this' pointer

● Add it to a list of 'this' aliases● For each basic block

● If instruction kills any of the 'this' aliases● If the alias list is empty return “not thiscall”

● If the instruction aliases one of the 'this' pointers● Add the new alias to the list

● If the instruction accesses memory using one of the aliases of 'this' then the function is likely 'thiscall'


● This can fail too● Generally it gives a correct answer in 90% of

the analyzed function● These results were validated by analyzing binaries

with symbols available● In practice this information allows us to detect

spurious functions detected as methods of a class

Example: calc.exe types

References

● http://www.pintool.org/

● http://www.dynamorio.org/

● http://wiki.qemu.org/Main_Page

● http://bochs.sourceforge.net/

● http://www.few.vu.nl/~asia/publications

● http://www.cs.purdue.edu/homes/xyzhang/reverse.html

● http://pages.cs.wisc.edu/~reps/

Thanks to

● Juliano Rizzo● Nicolas Waisman● Pablo Sole● Sean Heelan● Topo Muñiz

trace surfing - ekoparty · trace surfing a tale of data structure recovering and other yerbas by...

Documents