dynamic binary translation

Ras Bodik CS 164 Lecture 24 1

Dynamic Binary Translation

Lecture 24

acknowledgement: E. Duesterwald (IBM), S. Amarasinghe (MIT)

Ras Bodik CS 164 Lecture 242

Lecture Outline

• Binary Translation: Why, What, and When.

• Why: Guarding against buffer overruns

• What, when: overview of two dynamic translators:– Dynamo-RIO by HP, MIT– CodeMorph by Transmeta

• Techniques used in dynamic translators– Path profiling


Motivation: preventing buffer overruns

Recall the typical buffer overrun attack:

1. program calls a method foo()

2. foo() copies a string into an on-stack array:– string supplied by the user– user’s malicious code copied into foo’s array – foo’s return address overwritten to point to user code

3. foo() returns – unknowingly jumping to the user code


Preventing buffer overrun attacks

Two general approaches:

• static (compile-time): analyze the program – find all array writes that may outside array bounds – program proven safe before you run it

• dynamic (run-time): analyze the execution– make sure no write outside an array happens– execution proven safe (enough to achieve

security)


Dynamic buffer overrun prevention

the idea, again:

• prevent writes outside the intended array– as is done in Java– harder in C: must add “size” to each array

• done in CCured, a Berkeley project


A different idea

perhaps less safe, but easier to implement:– goal: detect that return address was overwritten.

instrument the program so that – it keeps an extra copy of the return address:

1. store aside the return address when function called (store it in an inaccessible shadow stack)

2. when returning, check that the return address in AR matches the stored one;

3. if mismatch, terminate program


Commercially interesting

• Similar idea behind the product by determina.com

• key problem: – reducing overhead of instrumentation

• what’s instrumentation, anyway?– adding statements to an existing program– in our case, to x86 executables

• Determina uses binary translation


What is Binary Translation?

• Translating a program in one binary format to another, for example:

– MIPS x86 (to port programs across platforms)

• We can view “binary format” liberally:

– Java bytecode x86 (to avoid interpretation)– x86 x86 (to optimize the executable)


When does the translation happen?

• Static (off-line): before the program is run– Pros: no serious translation-time constraints

• Dynamic (on-line): while the program is running– Pros:

• access to complete program (program is fully linked)• access to program state (including values of data struct’s)• can adapt to changes in program behavior

• Note: Pros(dynamic) = Cons(static)


Why? Translation Allows Program Modification

Program

Compiler

Linker Loader Runtime System

Static Dynamic

• Instrumenters

• Load time optimizers • Shared library mechanism

• Debuggers• Interpreters• Just-In-Time Compilers• Dynamic Optimizers• Profilers• Dynamic Checkers• instrumenters• Etc.


Applications, in more detail

• profilers: – add instrumentation instructions to count basic

block execution counts (e.g., gprof)

• load-time optimizers:– remove caller/callee save instructions

(callers/callees known after DLLs are linked)– replace long jumps with short jumps

(code position known after linking)

• dynamic checkers– finding memory access bugs (e.g., Rational Purify)


Dynamic Program Modifiers

Running Program

Dynamic Program Modifier:Observe/Manipulate Every Instruction in the Running Program

Hardware Platform


In more detail

common setup

CPU

OSDLL

application

CodeMorph

OSDLL

application

CPU=VLIW

CodeMorph(Transmeta)

Dynamo-RIO (HP, MIT)

CPU=x86

DLL

application

DynamoOS


Dynamic Program Modifiers

Requirements:: Ability to intercept execution at arbitrary points Observe executing instructions Modify executing instructions Transparency

- modified program is not specially prepared Efficiency

- amortize overhead and achieve near-native performance Robustness Maintain full control and capture all code

- sampling is not an option (there are security applications)


HP Dynamo-RIO

• Building a dynamic program modifier• Trick I: adding a code cache• Trick II: linking• Trick III: efficient indirect branch handling• Trick IV: picking traces

• Dynamo-RIO performance• Run-time trace optimizations


next VPC

Instruction Interpreter

System I: Basic Interpreter

decodefetch next instruction execute

exception handling

update VPC

Intercept execution

Observe & modify executing instructions

Transparency

Efficiency? - up to several 100 X slowdown


context switch

BASIC BLOCK CACHE

non-control-flow instructions

Trick I: Adding a Code Cache

next VPC

fetch block at VPC

lookup VPC

emitblock

exception handling

executeblock


add %eax, %ecx

cmp $4, %eax

jle $0x40106f

add %eax, %ecx

cmp $4, %eax

jle <stub1>

jmp <stub2>

mov %eax, eax-slot # spill eax

mov &dstub1, %eax # store ptr to stub table

jmp context_switch

mov %eax, eax-slot # spill eax

mov &dstub2, %eax # store ptr to stub table

jmp context_switch

frag7:

stub1:

stub2:

Example Basic Block Fragment


context switch

BASIC BLOCK CACHE


Runtime System with Code Cache

next VPC basic block builder

Improves performance:• slowdown reduced from 100x to 17-26x• remaining bottleneck: frequent (costly) context switches


add %eax, %ecx

cmp $4, %eax

jle $0x40106f

add %eax, %ecx

cmp $4, %eax

jle <frag42>

jmp <frag8>

mov %eax, eax-slot

mov &dstub1, %eax

jmp context_switch

mov %eax, eax-slot

mov &dstub2, %eax

jmp context_switch

frag7:

stub1:

stub2:

Linking a Basic Block Fragment


context switch

BASIC BLOCK CACHE


Trick II: Linking

next VPC

fetch block at VPC

lookup VPC

emitblock

exception handling

execute until cache miss

linkblock


Performance Effect of Basic Block Cache with direct branch linking

Performance Problem: mispredicted indirect branches

vpr (Spec2000)

2.97

26.03

3.63

17.45

02468

10121416182022242628

block cache block cache with directlinking

Slo

wd

ow

n o

ve

r N

ati

ve

Ex

ec

uti

on

data set 1

data set 2


ret

<preferred target>

mov %edx, edx_slot # save app’s edx

pop %edx # load actual target

<save flags>

cmp %edx, $0x77f44708 # compare to

# preferred target

jne <exit stub >

mov edx_slot, %edx # restore app’s edx

<restore flags>

<inlined preferred target>

Conditionally “inline” a preferred indirect branch target as the continuation of the trace

Indirect Branch Handling

Indirect Branch Linking

H

I

K

L

J

original target F

original target H

Shared Indirect Branch Target (IBT) Table

linked targets

<load actual target><compare to inlined target>if equal goto <inlined target>

lookup IBT table if (! tag-match) goto <exit stub>jump to tag-value

<inlined target>

<exit stub>


basic block builder

context switch

indirect branch lookup

BASIC BLOCK CACHE

non-control-flow

instructions

next VPC

miss

miss

Trick III: Efficient Indirect Branch Handling


Performance Effect of indirect branch linking

Performance Problem: poor code layout in code cache

vpr (Spec2000)

3.63

1.20

2.97

26.03

1.15

17.45

0123456789

10

block cache block cache with directlinking

block cache with linking(direct+indirect)

Slo

wd

ow

n o

ve

r N

ati

ve

E

xe

cu

tio

n

data set 1

data set 2


Trick IV: Picking Traces

Block Cache has poor execution efficiency:• Increased branching, poor locality

Pick traces to: • reduce branching & improve layout and locality• New optimization opportunities across block

boundaries

A

B

D G

E

C F

H

I

J

K

L

A

B

E

F

H

D

G

K

J

Block Cache Trace Cache


basic block builder

trace selectorSTART

dispatch

context switch


BASIC BLOCK CACHE

TRACE CACHE



Picking Traces


Picking hot traces

• The goal: path profiling– find frequently executed control-flow paths – Connect basic blocks along these paths into

contiguous sequences, called traces.

• The problem: find a good trade-off between – profiling overhead (counting execution events),

and– accuracy of the profile.


Alternative 1: Edge profiling

The algorithm:• Edge profiling: measure frequencies of all

control-flow edges, then after a while• Trace selection: select hot traces by following

highest-frequency branch outcome.

Disadvantages:• Inaccurate: may select infeasible paths (due to

branch correlation)• Overhead: must profile all control-flow edges


Alternative 2: Bit-tracing path profiling

The algorithm:– collect path signatures and their frequencies– path signature = <start addr>.history– example: <label7>.0101101– must include addresses of indirect branches

Advantages:– accuracy

Disadvantages:– overhead: need to monitor every branch– overhead: counter storage (one counter per

path!)


Alternative 3: Next Executing Tail (NET)

This is the algorithm of Dynamo:– profiling: count only frequencies of start-of-

trace points (which are targets of original backedges)

– trace selection: when a start-of-trace point becomes sufficiently hot, select the sequence of basic blocks executed next.

– may select a rare (cold) path, but statistically selects a hot path!


NET (continued)

A

B

D G

E

C F

H

I

J

K

L

Advantages of NET: very light-weight #instrumentation points = #targets of backward branches #counters = #targets of backward branches

statistically likely to pick the hottest path pick only feasible paths easy to implement


Spec2000 Performance on Windows(w/o trace optimizations)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

art

bzi

p2

cra

fty

eo

n

eq

ua

ke

ga

p

gcc

gzi

p

mcf

me

sa

pa

rse

r

pe

rlbm

k

two

lf

vort

ex

vpr

H_

ME

AN

Slo

wd

ow

n v

s.

Na

tiv

e E

xe

cu

tio

n


Spec2000 Performance on Linux(w/o trace optimizations)

0.00.10.20.30.40.50.60.70.80.91.01.11.21.31.41.51.61.7

amm

p

appl

u

apsi art

bzip

2

craf

ty

eon

equa

ke

gap

gcc

gzip

mcf

mes

a

mgr

id

pars

er

perl

bmk

sixt

rack

swim

twol

f

vort

ex vpr

wup

wis

e

H_M

EA

N

Slo

wd

ow

n v

s. N

ati

ve

Ex

ec

uti

on


Performance on Desktop Applications

0.00.10.20.30.40.50.60.70.80.91.01.11.21.31.41.51.6

Adobe Acrobat Microsoft Excel MicrosoftPowerPoint

Microsoft Word

Slo

wd

ow

n v

s.

Na

tiv

e E

xe

cu

tio

n


Performance Breakdown

code cache86%


11%

trace branch taken2% rest of system

1%


Trace optimizations

• Now that we built the traces, let’s optimize them• But what’s left to optimize in a statically

optimized code? • Limitations of static compiler optimization:

– cost of call-specific interprocedural optimization– cost of path-specific optimization in presence of complex

control flow– difficulty of predicting indirect branch targets– lack of access to shared libraries– sub-optimal register allocation decisions– register allocation for individual array elements or

pointers


Maintaining Control (in the real world)

• Capture all code: execution only takes place out of the code cache

• Challenging for abnormal control flow

• System must intercept all abnormal control flow events:• Exceptions• Call backs in Windows• Asynchronous procedure calls • Setjmp/longjmp• Set thread context

dynamic binary translation

Documents