paradyn project petascale tools workshop madison, wisconsin aug 4-aug 7, 2014 binary code is not...

30
Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

Upload: ethan-cannon

Post on 02-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

Paradyn Project

Petascale Tools WorkshopMadison, WisconsinAug 4-Aug 7, 2014

Binary Code is Not Easy

Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

Page 2: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

Three parsing stages

Binary Code is Not Easy 2

7a 77 0e 20 e9 3d e0 09 e8 68

c0 45 be 79 5e 80 89 08 27 c0

73 1c 88 48 6a d8 6a d0 56 4b

fe 92 57 af 40 0c b6 f2 64 32

f5 07 b6 66 21 0c 85 a5 94 2b

20 fd 5b 95 e7 c2 42 3d f0 2d

7a 77 0e 20 e9 3d e0 09 e8 68

c0 45 be 79 5e 80 37 1b 2f b9

xchg %eax,%ecxfdiv %st(3),%stjmp *-0xf(%esi)add %edi,%ebpjmp *-0x39(%ebp)mov 0xc(%esi),%eax

Code Discovery

CFG Construction

CFG Partitioning

foo: bar:

Binary file:

Page 3: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

3

Parsing approaches

o Linear scan

o Control flow (recursive) traversal

o Speculative disassembly

Binary Code is Not Easy

Page 4: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

4

Parsing approach: Linear scan

o Decode instructions sequentially starting from a specific pointo Easy to implemento Discovers almost all the instructionso Can confuse data with codeo Can be confused about instruction alignment

o Tools that use linear scano GNU Objdumpo IDA Proo CodeSurfer/x86o BAP

Binary Code is Not Easy

Page 5: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

5

Parsing approach: Control flow traversal

Binary Code is Not Easy

o Decodes instructions start from known function entry points and follows control transfers of the programoOnly looks at code and ignores dataoCompleteness depends on quality of

indirect control flow analysisoDepends on gap parsing to find code not

reachable by the traversalo Tools that use control flow traversal

oDyninst

Page 6: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

6

Parsing approach: Speculative disassemblyo Tries to decode instructions starting at

every byteoWill not miss any instructionoDoes not always find the actual start of an

instructionoMay confuse code with data

o Tools that use speculative disassemblyoDyninst (gap parsing only)oOllyDebugo SecondWrite

Binary Code is Not Easy

Page 7: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

Difficulties of accurate parsing

Binary Code is Not Easy 7

.text

Symbol tableaddress of foo: 0x4000address of bar: 0x40404000 <foo>:…………jmp %eax

data

…………retxchg %eax %eax

4040 <bar>:…………call _abort

…………jmp bar

Code Discovery:• Code and data are mixed• Compiler may insert padding• Function symbols can be incomplete or

missing• Instructions may overlap

Binary file:

Page 8: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

Difficulties of accurate parsing

Binary Code is Not Easy 8

.text

Symbol tableaddress of foo: 0x4000address of bar: 0x40404000 <foo>:…………jmp %eax

data

…………retxchg %eax %eax

4040 <bar>:…………call _abort

…………jmp bar

Code Discovery:• Code and data are mixed• Compiler may insert padding• Function symbols can be incomplete or

missing• Instructions may overlap

CFG Construction:• Indirect control flow• Non-returning functions• Exception handling

Binary file:

Page 9: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

Difficulties of accurate parsing

Binary Code is Not Easy 9

.text

Symbol tableaddress of foo: 0x4000address of bar: 0x40404000 <foo>:…………jmp %eax

data

…………retxchg %eax %eax

4040 <bar>:…………call _abort

…………jmp bar

Code Discovery:• Code and data are mixed• Compiler may insert padding• Function symbols can be incomplete or

missing• Instructions may overlap

CFG Construction:• Indirect control flow• Non-returning functions• Exception handling

CFG Partitioning:• Binary functions are complex

• Functions may share code• Function body may not be

continuous• Functions do not always terminate in a

return; sometimes they terminate in a jump (tail call) or a call instruction (non-returning function).

Binary file:

Page 10: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

How well do other tools do?

Binary Code is Not Easy 10

Stage Challenge GNU Objdump IDA Pro

Code Discovery

Code or data 0/2 1/2

Missing symbols recover 0/1227false entry 0

recover 608/1227false entry 408

Overlapping instructions

0/1 0/1

CFG Construction

Indirect control flow 0/6 2/6

Non-returning functions

0/2 1/2

CFG Partitioning

Functions sharing code

0/1 0/1

Non-continuous functions

0/1 0/1

Tail calls 0/1 1/1

Page 11: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

There is interaction between the parsing stages

Binary Code is Not Easy 11

Code Discovery CFG Construction

CFG Partitioning

Page 12: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

The ParseAPI approach

Binary Code is Not Easy 12

Code DiscoveryCFG ConstructionCFG Partitioning

Control flow (recursive) traversalCFG function representation model

Entry point of

foo

Entry point of

bar

Intraprocedural edge

Interprocedural edge

Challenge specific analyses

Page 13: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

ParseAPI’s mechanisms and status

Binary Code is Not Easy 13

Stage Challenge Our technique ParseAPI Status

Code Discovery

Code or data Control flow traversal

Done

Missing symbols Probabilistic gap parsing

Integrating

Overlapping instructions

Control flow traversal

Done

CFG Construction

Indirect control flow

Backward slicing + Symbolic evaluation

Integrating

Non-returning functions

Might-ret analysis Done

Exception handling

Leveraging debugging information

Limited

CFG Partitioning

Complex functions

CFG function representation model

Done

Tail calls Stack analysis + CFG structure analysis

Prototyping

Page 14: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

14

Challenge: Code or Data?

Binary Code is Not Easy

However, if the jump table data bytes are misinterpreted as code:

Jump table data:

Entry Address Bytes Data value

eax=0 80b15f0 e0 4c 02 00 0x24ce0

eax=1 80b15f4 98 4a 02 00 0x24a98

eax=2 80b15f8 a6 4a 02 00 0x24aa6

eax=3 80b15fc b3 4a 02 00 0x24ab3

80b134e: call 8050760 <__i686.get_pc_thunk.bx>80b1353: add $0x25025,%ebx80b15e5: mov %ebx,%ecx80b15e7: sub -0x24d88(%ebx,%eax,4),%ecx # [80b15f0+eax*4]80b15ee: jmp *%ecx

80b15f0: e0 4c loopne 80b163e 80b15f2: 02 00 add (%eax),%al80b15f4: 98 cwtl 80b15f5: 4a dec %edx80b15f6: 02 00 add (%eax),%al80b15f8: a6 cmpsb %es:(%edi),%ds:(%esi)80b15f9: 4a dec %edx80b15fa: 02 00 add (%eax),%al80b15fc: b3 4a mov $0x4a,%bl80b15fe: 02 00 add (%eax),%al

Page 15: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

15

Challenge: Overlapping instructions

Binary Code is Not Easy

3fe9e8: je 3fe9eb3fe9ea: lock cmpxchg %ecx, 0x35b0(%ebx)3fe9f2: jne 3ff740

e8 e9 ea eb ec ed ee ef f0 f1 f2 ~ f7

74 01 f0 0f b1 8b b0 35 00 00 0f .. 00

je 3fe9eb lock cmpxchg %ecx, 0x35b0(%ebx) jne 3ff740 cmpxchg %ecx, 0x35b0(%ebx)

Address 3fe9

Bytes

In ParseAPI, overlapping instructions are in separate basic blocks that have the same predecessors and successors in the CFG.

Page 16: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

16

Challenge: Indirect control flow

1. Backward slice on jmpq:All instructions that calculate the jump targets

Binary Code is Not Easy

c8a42a: cmp $0xc,%dilc8a42e: ja c8a518 c8a434: lea 0x31fd41(%rip),%r8c8a43b: movzbl %dil,%edic8a43f: mov %rsi,%rbpc8a442: movslq (%r8,%rdi,4),%raxc8a446: add %rax,%r8c8a449: jmpq *%r8

Page 17: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

17

Challenge: Indirect control flow

1. Backward slice on jmpq

Binary Code is Not Easy

c8a42a: cmp $0xc,%dilc8a42e: ja c8a518 c8a434: lea 0x31fd41(%rip),%r8c8a43b: movzbl %dil,%edic8a43f: mov %rsi,%rbpc8a442: movslq (%r8,%rdi,4),%raxc8a446: add %rax,%r8c8a449: jmpq *%r8

2. Symbolically evaluate the jump target

Variable

Symbolic value

rax

rdi

r8

c8a42a and c8a42e: dil = RDI and RDI ≤ 12

c8a434: r8; = rip + 31fd41 ; = faa175

c8a43b: RDI ≤ 12

c8a442: rax = [r8 + rdi × 4]; = [faa175 + RDI × 4]

c8a446: ; r8 = r8 + rax ; = faa175 + [faa175 + RDI × 4]

RDI (RDI ≤ 12)

faa175

faa175 + [faa175 + RDI × 4]

[faa175 + RDI × 4]

<Offset base> + [<Table base> + <Table index> × <Table stride>]

Page 18: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

18

Challenge: Indirect control flow

Binary Code is Not Easy

Variation and complexity of indirect control flowo Jump target formulas can involve 0 or more

memory accesses

o Variation of formula elemento Table stride can vary; we have seen 2, 4 and 8o Table index can come from any register or a memory

locationo Offset base and table base may be computed by various

instructions

o Table index conditiono Comparison and conditional jump instruction pairo Bit operation “and” : and 0x3,%rax implies rax ≤ 3

<Table base> ± <Table index> × <Table stride>

[<Table base> ± <Table index> × <Table stride>]<Offset base> ± [<Table base> ± <Table index> × <Table stride>]

Page 19: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

19

Challenge: Non-returning functions

Binary Code is Not Easy

Looks like the call is inside a

loop

Page 20: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

20

Challenge: Non-returning functions

Binary Code is Not Easy

CFG of the function when not considering non-returning functions

Superficial call fall through edge

Function call edge

Intraprocedural edge in loop

Intraprocedural edge not in loopThe loop is

created by superficial call

fall through edges

Page 21: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

21

Challenge: Non-returning functions

Binary Code is Not Easy

CFG of the function

No loopIntraprocedural edge

Function call edge

Page 22: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

22

Might-ret analysis

o Return status of a function can be "unknown", "might ret", or “does not ret"

o Calculate a fix point for all functions

Binary Code is Not Easy

1. known funcsexit_exitabort__f90_stopfancy_abort__stack_chk_fail__assert_failExitProcess………

does not ret

2. reach ret

might ret

<foo>:push %ebpmov %esp,%ebpsub $0x18,%espmov (%eax),%eaxleaveret

3. call

unknown: first analyze barmight ret: continue in foodoes not ret: stop

<foo>:…………call <bar>…………

bar’s ret status

4. recursion

foo and bar do not ret

<foo>: unknown

<bar>: unknown

Page 23: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

23

Conclusions

o This work was driven by our experiences with real binaries

o We discussed the challenges in three parsing stages

o We analyzed code examples in details and discussed how we address these challenges

Binary Code is Not Easy

Page 24: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

24

Missing Symbols

o Stripped binarieso Pattern matchingo Probabilistic parsing

Binary Code is Not Easy

Page 25: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

25

Indirect control flow

Binary Code is Not Easy

445755: 0x10(%rbx),%rax445759: test %rax,%rax44575c: je 4457c344575e: pop %rbx44575f: pop %rbp445760: mov %r12,%rdi445763: pop %r12445765: pop %r13445767: pop %r14445769: jmpq *%rax

1. Backward slice2. Symbolic

evaluation3. Bound fact analysis4. Get jump targets

Symbolic expression: jmpq [rbx+0x10]

Bound facts:

Calculate and check jump targets:cannot find any candidate jump targets!

Page 26: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

Complex functions

26Binary Code is Not Easy

38958db680 <__write>:b680: cmpl $0x0,0x2b6fe9(%rip)b687: jne 38958db699

38958db689 <__write_nocancel>:b689: mov $0x1,%eax…………………b696: jae 38958db6c9 b698: retq b699: sub $0x8,%rsp…………………b6c6: jae 38958db6c9 b6c8: retq b6c9: mov 0x2b18d0(%rip),%rcx…………………b6dc: jmp 38958db6c8

Jump into

another function

!

Block1: [b680, b689)

Block2: [b689, b698)

Block3: [b698, b699)

Block4: [b699, b6c8)

Block5: [b6c8, b6c9)

Block6: [b6c9, b6df)

MECFG Model:

__write: block 1,2,3,4,5,6__write_nocancel: block 2,3

Entry of __write

Entry of __write_nocanc

el

Page 27: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

27

Challenge: Indirect control flow

1. Backward slice on jmpq

2. Symbolic evaluation

3. Bound fact analysis4. Get jump targets

Binary Code is Not Easy

c8a42a: cmp $0xc,%dilc8a42e: ja c8a518 c8a434: lea 0x31fd41(%rip),%r8c8a43b: movzbl %dil,%edic8a43f: mov %rsi,%rbpc8a442: movslq (%r8,%rdi,4),%raxc8a446: add %rax,%r8c8a449: jmpq *%r8

Jump target expression and condition: jmpq [rdi*4+0xfaa17c]+0xfaa17c

rdi <= 0xc

rdi=0 faa17c: d4 02 ce ff # -0x31fd2cfaa180: 27 ff ff ff # -0x31fcecfaa184: 4d ff ff ff # -0x31fcbc

rdi=1

rdi=2

jmpq c8a450

jmpq c8a490

jmpq c8a4c0

Page 28: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

28

o Identifying tail calls is difficult because known function entry points may be incomplete

Tail calls

Binary Code is Not Easy

42b678: pop %rbp42b679: pop %r1242b67b: pop %r1342b67d: pop %r1442b67f: jmpq 42a5f0

Is 42a5f0 a function entry

or not?

Page 29: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

29

Tail calls

Binary Code is Not Easy

f92a0 <_ZThn16_N12BPatch_imageD0Ev>:f92a0: add $0xfffffffffffffff0,%rdi f92a4: jmp <_ZN12BPatch_imageD0Ev>f92a6: nopw %cs:0x0(%rax,%rax,1)f92ad: 00 00 00

f92b0 <_ZN12BPatch_imageD0Ev>:f92b0: push %rbxf92b1: mov %rdi,%rbx……………

Code added by the compiler to implement virtual functions. We do not have evidence to decide whether :(a)A function tail calls the other one.(b)Two functions share code

Page 30: Paradyn Project Petascale Tools Workshop Madison, Wisconsin Aug 4-Aug 7, 2014 Binary Code is Not Easy Xiaozhu Meng, Emily Gember-Jacobson, and Bill Williams

30

Tail call analysis

o Indication of tail callso Control flow structures

Binary Code is Not Easy