trace surfing - ekoparty · trace surfing a tale of data structure recovering and other yerbas by...
TRANSCRIPT
Trace SurfingA tale of data structure recovering and other yerbas
By Agustin Gianni – Immunity Inc.
Problem Statement
Given a memory trace, what information does the trace gives us about the underlying data
structures?
Road map
● Investigation of previous approaches● Realization that they kind of suck● Enlightenment phase how can we improve→
Introduction
● What is a memory trace? ● A memory trace is a collection of all the memory
accesses performed by an application.– Both reads and writes
● How can I obtain a memory trace?● Binary Instrumentation
– pintool – DynamoRIO
● Full system emulation– QEMU– BOCHS
Example Memory Trace
# White listed image `calc.exe` # Loading hooks from file hooks.hks # Loaded hook alloc:test_custom_alloc:00000774:0:my_alloc_ # Loaded hook free:test_custom_alloc:000007b6:0:my_free_ L:calc.exe:0x003a0000:0x0045ffff # Thread 0x0 started # Instrumented malloc at 0x75619cee # Instrumented free at 0x75619894 # Instrumented realloc at 0x7561b10d # Instrumented calloc at 0x7561c456 W:0x003a76c6:0x01d125e0:0x01d125e0:0x00000004:0x0000000f W:0x003a76cc:0x01d125e0:0x01d125e4:0x00000004:0x0000000f … F:0x003b8f9a I:0x003b8f9a:0x00000031:0x00000000 F:0x003b8fdc I:0x003b8fdc:0x00000031:0x00000000 # Thread 0x1 did not finish but application exited.
Introduction
● Why do we care about recovering data structures?● Large binaries are a pain to reverse
– Specially Object Oriented Code● Virtual Function Tables and friends
● Makes reverse engineering happier● Saves time
● Why not?● Computers got fast enough to trace every single
memory access
HexRays – With data types
Introduction
● Has anyone approached the problem?● Dynamic analysis
– Howard: Dynamic Excavator for Reverse Engineering Data Structures
– Rewards: DDE, Dynamic Data Structure Excavation● Static analysis
– WYSINWYX: What You See Is Not What You eXecute● Based on abstract interpretation, blah, blah, blah!
The Rewards / Howard approach
● Trace every single memory access● Heap● Stack
● Define type sinks● System Calls● Library Calls● Special purpose instructions
– For instance, string manipulation instructions on Intel architecture.
● Propagate recovered types● Analyze the memory trace
Type Sinks
● A type sink is a function, syscall or instruction that we know which types it is taking
● System calls and standard libraries are the more verbose● For instance:
– ssize_t read(int fd, void *buf, size_t count);– Leaks four types: ssize_t, int, void *, size_t– Also we can extract semantics
● We know that 'fd' is a descriptor● 'buf' is a buffer● Etc.
Type Sinks
● Instructions can also leak types● Intel String Operations
– CMPS, INS, LODS, MOVS, OUTS, SCAS, STOS● Intel Floating Point Instructions
– FADD, FDIV, FMUL, and so on.● Jumps
– JG / JL Signed Integers→– JA / JB Unsigned Integers→
● Memory dereferences– Data dereferences leak half a type
● We just know the dereferenced address is a pointer
– Indirect calls leak function pointer types● We know that the dereferenced address contains a pointer to a function.
What do we want to recognize?
● Things to recognize:● Structures / Classes● Arrays● Pointers
● How?● Study how the memory is accessed
Identifying Pointers
● Pointers are 'easy' to detect● Just see what instructions dereference memory● The dereferenced argument must be a valid pointer
– Otherwise the program would crash● Problem
– We cannot yet know the type of the pointer– If we are lucky enough, and by lucky I mean that we have
sufficient code coverage, we will identify the type of the pointer.
Warning : we are entering the terrain of the incomplete and unsound assumptions.
Absolute correctness
● Do we really care about absolute correctness? ● Hint I don't→● Even if we could automatically identify a fraction of
the types correctly, that saves us work.● Eventually decisions/corrections must be done
● Inconsistent typing is detected by humans● We are not aiming to solve unsolvable problems
● We cannot get back what is not there– Compilation is not bidirectional
● Although Rolf may argue this I've been told ;)
Identifying Structures
● Typically structure fields are accessed in an indirect way● This depends heavily on the compiler and the
optimization level.● Often, access patterns will be similar.
● Example● Let A be a base pointer● *(A + 0) is the first field● *(A + 8) is the second field● And so on
Identifying Structures
● What we want to do is to detect indirect memory addresses.● We can obtain this from a memory trace
● But …● What if A was not a structure
– Let A be an array– *(A + 0) is the first element– *(A + 8) is the second element– And … we are screwed
● Also, sometimes structure fields are accessed directly– There is no base pointer
Identifying Structures
● There is no way we can decide, with certainty, whether a pointer points to a structure or an array● We have to make unsound assumptions● Rely on compiler specific constructs● Heuristics● And why not a bit of magic
● In the end, manual work needs to be done● Still, less work than reversing manually
Identifying Structures
● To distinguish between arrays and structures we use some heuristics● Memory accesses are generally scattered
– Example:● Access field at offset 0x00● Then offset 0x10● And so on
● Size of the access is generally heterogeneous– Example:
● Access field 2 which is an integer● Then access field 3 which is a short integer● Etc.
Identifying Structures - Example
● Memory accesses● 1 – DWORD● 2 – DWORD● 3 – WORD● 4 – WORD● 5 – DWORD● 6 – BYTE● 7 – WORD
6
2
3
5
6
7
1
4
Identifying Structures
● There are a considerable amount of cases where this will fail● The most trivial cases
– Initializing a structure with “memset”– Copying a structure with “memcpy”
● How do we solve this– If we have more than one access pattern, favor the more
irregular
Identifying Arrays
● We can identify arrays by watching memory accesses on loops● There are two cases
– Sequential memory accesses– Random memory accesses
Identifying Arrays
● Sequential memory accesses● Let A be a pointer● We are on a loop● A is dereferenced at loop cycle one.● B is generated also at loop cycle one.● Next iteration● B is dereferenced.● A is likely an array pointer
Identifying Arrays
● Random memory accesses● If all the accesses are of the same size we have a
hint that we are dealing with an array.● But it is also likely that it could be an structure.● This is getting hairy.
So, where are we?
Where are we?
● Detecting whether a pointer points to an array or a structure is essentially an educated guess.● We need to further “educate” ourselves● We need to have stronger assumptions that we can rely on.
● Tracing stack memory accesses is tricky● What about address reutilization
– We need to tag every address with a TAG to differentiate two identical addresses accessed in different times
● Tracing all memory accesses is painfully slow● We are interested in large binaries
Are we screwed then?
● Not really● We need to make our analysis a little bit more
specific● Hence less complete● But more accurate
● It is all about giving up a bit of generality for a bit more of accuracy
Looking for better waves
Focus on Heap Objects
● Why?● Heap objects are shared. We like data that is shared
– It leads to good things from an vulnerability research point● We have more information
– “malloc” like functions give us the size of the chunks● It is easier to track heap memory
– Hook allocation routines and tag the returned memory with a unique id– Hook also deallocation routines to keep track of valid memory chunks
Object Oriented Code
● Objects are basically structures with methods● Each object method needs to somehow reference its underlying object.● Objects of a given class share a set of common characteristics
● Most of them come from the heap● Or at least those object with shared state information
● So if we focus on objects, the problem is a bit less complicated● We are dealing with structures of know size● Now the whole address space is reduced to a fraction of its size
– Just analyze the .heap● Keeping track of the life of a heap memory region is simple
– Hook the allocation routine The block is alive→– Hook the free routine The block is dead→
How to detect objects?
● Not every single heap chunk is an object● Heuristics!
● Take advantage of calling conventions– Visual Studio: will set the 'ecx' register to the 'this' pointer– GCC 32 bits: pushes as the first method argument the object – GCC 64 bits: 'rsi' is set to the 'this' pointer
● So we mark every tracked heap chunk that is on “ecx”, “rsi” or the first argument of a function as a possible object
● The object must be used inside the potential method
How to detect methods
● There is no sound way● We have to trust our heuristics
● Which are better than most Anti-Virus heuristics :P● We are going to miss some methods
● The dynamic nature of a trace makes us rely on code coverage.● We are going to mark some functions as methods
● Sometimes the this pointer remains spuriously in 'ecx'
So, how are we now?
We are doing better!
● We can detect “interesting objects”● We know its size● We know where they are being used
● What else we need to do?● Detect fields● Detect relationships with other types
– Inheritance– Composition
Detecting Object Fields
● We already have all heap memory accesses in our trace● If the memory access is to one of our interesting objects we
save the access offset and size● Since we only track interesting objects the analysis is much
quicker● We can implement the algorithms used by Howard/Rewards
● If we have information from type sinks, we can propagate it
Detecting methods
● On each function call check if ECX points to a heap object.● If true
– Mark the chunk as interesting– Save the access offset for future usage
● Mark the function as interesting● Does this function get called again with the same
conditions?– That is, the same function gets called with a chunk of the
same size as the 'this' parameter
How far can we go?
How far can we go?
● With all the collected traces we can obtain quite a lot of information● Class Hierarchy● Virtual Function Tables● Types!● Bonus (not really related with type inference)
– Code coverage information– Indirect branch resolution
How can we achieve this?How can we achieve this?
Virtual Function Tables
● Useful to help IDA Pro to discover more functions
● For each write to an interesting chunk● Is the value written referring to .text ?
– Is [value] also in .text?● This is for sure a Virtual Function Table
– If not, it is just a field update
Types
● Type reconstruction algorithm is divided in three phases● First Analysis Pass (FAP)
– Pun intended● Second Analysis Pass (SAP)● Third Analysis Pass (TAP)
First Analysis Pass
● For each function● Get all its interesting chunks
– That is chunks that were passed as the 'this' argument● Mark the whole chunk as a composite type
– Set the composite type size to the size of the chunk● If 'this' does not point to the first byte of the chunk,
get the offset– Divide the composite chunk in two types at the calculated
offset● Repeat the process with all the methods that used
the chunk and subdivide the composite type
First Analysis Pass
Composite Type
Chunk
chunk_address = Aecx_address = A + 0
Composite Type
TypeAOffset = 0
In this case, TypeA fills the whole composite type
First Analysis Pass
Composite Type
Chunk
chunk_address = Aecx_address = A + C
Composite Type
TypeA
TypeBOffset = C
In this case there are two types, we recognize this because there were twomethods called with 'this' pointing at the same memory chunk but ata different offset.
First Analysis Pass – continued
● Add the current function to a list of methods● For each write to the interesting chunk
● Add a field at the offset of the write● Mark the field with the corresponding basic type
according to the write size– For instance, a write of four bytes is marked as
“uint32_t”
First Analysis Pass – continued
● Collect a set of constraints● For each chunk that was received as the 'this'
argument build a map from the method address to a list of all the types created.
● This will be later used build relationships between types and subsequent merging of identical types
method_at_0xcafecafe
Type_A
Type_B
Type_C
Size = X_1
Size = X_1
Size = X_2
Second Analysis Pass
● Merge similar types● Cheat by first using the type constraints collected on the FAP phase
● How do we define similar?● They have the same size
– Equal types with differing sizes will be addressed in the third pass● They have equivalent fields
– That is, at offset O there is a type T of size S in both types● They share a set of methods
– How many? ● Let N be the number of methods in Type1● Let M be the number of methods in Type2● Let S be the number of shared methods● SimilarityIndex(N,M,S) = (S / (N+M)) * 100 ● If SimilarityIndex > SimilarityThreshold then they are similar
Third Analysis Pass
● There are types that share methods and fields but they differ in size
● What is going on?● There are two possible scenarios
– Type2 in inherits from Type1● len(Type2) > len(Type1) most of the times
– The type has an internal buffer● This is the case of for example strings in some browsers
Inheritance / Composition
● A simple inheritance relationship is translated into a composition of structures
Inheritance / Composition
ClassAField1Field2Field3Field4
ClassB
Field1Field2Field3
ClassAField1Field2Field3Field4
ClassAField1Field2Field3Field4
Inheritance / Composition
● Two classes of different size use the same method● The bigger one is likely the child class● The smallest one is likely the parent class● This heuristic can fail
● Say that we have a dynamically allocated buffer inside a class– Rare, weird, but it can and will happen
● Failure will generate an extra type but the relationships between the types will still be interesting and can be detected by a human once the information is imported into IDA Pro
Hard example :)
StringClass
StringMetadatauint32_t len
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA…AAAAAAAAAAAAAAA???????????????
● Example string class that will contain metadata and contents on the same chunk of memory
● Other recurring complex examples are hash tables
Increasing accuracyIncreasing accuracy
Increasing accuracy
● Accuracy of our approach is directly related with code coverage● The more code coverage, the more accuracy
● Increasing code coverage● The “smart” way
– We can tweak Klee (requires source code)– We can code our version of SAGE
● ???● Profit
● The “other” way– Fuzz the application like a 15 year old– Gather a set of input files (if possible) and calculate the set of files that gets the
maximum coverage
Static Analysis
● How can we further validate our results?● Detecting calling convention
● We have collected a fair amount of information, how can we propagate this information?● Propagating the type information into basic blocks
not executed on the trace● Or we can be lazy and let HexRays decompiler to do
it for us :)
Calling convention detection
● A spurious function calls can happen when a non method function is called on a method
● The function call can receive the 'this' pointer of the previous method call
● We avoid this case by ruling out all the function calls that do not behave as thiscall
Calling convention detection
● Given a function get its CFG
● Obtain a DAG (direct acyclic graph)
● Do a topological sort
● Assume ECX is a 'this' pointer
● Add it to a list of 'this' aliases● For each basic block
● If instruction kills any of the 'this' aliases● If the alias list is empty return “not thiscall”
● If the instruction aliases one of the 'this' pointers● Add the new alias to the list
● If the instruction accesses memory using one of the aliases of 'this' then the function is likely 'thiscall'
Calling convention detection
● This can fail too● Generally it gives a correct answer in 90% of
the analyzed function● These results were validated by analyzing binaries
with symbols available● In practice this information allows us to detect
spurious functions detected as methods of a class
Example: calc.exe types
Example: calc.exe types
Example: calc.exe types
References
● http://www.pintool.org/
● http://www.dynamorio.org/
● http://wiki.qemu.org/Main_Page
● http://bochs.sourceforge.net/
● http://www.few.vu.nl/~asia/publications
● http://www.cs.purdue.edu/homes/xyzhang/reverse.html
● http://pages.cs.wisc.edu/~reps/
Thanks to
● Juliano Rizzo● Nicolas Waisman● Pablo Sole● Sean Heelan● Topo Muñiz