building dynamic instrumentation tools with dynamorio derek bruening timothy garnett

147
Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Upload: sabina-washington

Post on 01-Jan-2016

218 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Building Dynamic Instrumentation Tools With

DynamoRIO

Derek Bruening

Timothy Garnett

Page 2: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 2

Tutorial Outline

1:00-1:10 Welcome + DynamoRIO History

1:10-2:00 DynamoRIO Internals

2:00-2:30 Examples, Part 1

2:30-2:45 Break

2:45-3:15 DynamoRIO API

3:15-3:45 Examples, Part 2

3:45-4:00 Feedback

Page 3: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 3

DynamoRIO History

Dynamo

HP Labs: PA-RISC late 1990’s

x86 Dynamo: 2000

RIO DynamoRIO

MIT: 2001-2004

Prior releases

0.9.1: Jun 2002 (PLDI tutorial)

0.9.2: Oct 2002 (ASPLOS tutorial)

0.9.3: Mar 2003 (CGO tutorial)

0.9.4: Feb 2005

Page 4: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 4

DynamoRIO History

0.9.5: Apr 2008 (CGO tutorial)

0.9.6: Sep 2008 (GoVirtual.org launch)

Determina

2003-2007

Security company

VMware

Acquired Determina (and DynamoRIO) in 2007

Page 5: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 5

Tutorial Outline

1:00-1:10 Welcome + DynamoRIO History

1:10-2:00 DynamoRIO Internals

2:00-2:30 Examples, Part 1

2:30-2:45 Break

2:45-3:15 DynamoRIO API

3:15-3:45 Examples, Part 2

3:45-3:00 Feedback

Page 6: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

DynamoRIO Internals

Page 7: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 7

Typical Modern Application: IIS

Page 8: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 8

Runtime Interposition Layer

Page 9: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 9

Design Goals

Efficient

Near-native performance

Transparent

Match native behavior

Comprehensive

Control every instruction, in any application

Customizable

Adapt to satisfy disparate tool needs

Page 10: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 10

Challenges of Real-World Apps

Multiple threads

Synchronization

Application introspection

Reading of return address

Transparency corner cases are the norm

Example: access beyond top of stack

Scalability

Must adapt to varying code sizes, thread counts, etc.

Dynamically generated code

Performance challenges

Page 11: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 11

Internals Outline

Efficient

Software code cache overview

Thread-shared code cache

Cache capacity limits

Data structures

Transparent

Comprehensive

Customizable

Page 12: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 12

Direct Code Modification

Kernel32!TerminateProcess:

7d4d1028 7c 05 jl 7d4d102f

7d4d102a 33 c0 xor %eax,%eax

7d4d102c 40 inc %eax

7d4d102d eb 08 jmp 7d4d1037

7d4d102f 50 push %eax

7d4d1030 e8 ed 7c 00 00 call 7d4d8d22

e9 37 6f 48 92 jmp <callout>

Page 13: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 13

Variable-Length Instruction Complications

Kernel32!TerminateProcess:

7d4d1028 7c 05 jl 7d4d102f

7d4d102a 33 c0 xor %eax,%eax

7d4d102c 40 inc %eax

7d4d102d eb 08 jmp 7d4d1037

7d4d102f 50 push %eax

7d4d1030 e8 ed 7c 00 00 call 7d4d8d22

e9 37 6f 48 92 jmp <callout>

Page 14: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 14

Entry Point Complications

Kernel32!TerminateProcess:

7d4d1028 7c 05 jl 7d4d102f

7d4d102a 33 c0 xor %eax,%eax

7d4d102c 40 inc %eax

7d4d102d eb 08 jmp 7d4d1037

7d4d102f 50 push %eax

7d4d1030 e8 ed 7c 00 00 call 7d4d8d22

e9 37 6f 48 92 jmp <callout>

Page 15: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 15

Direct Code Modification: Too Limited

Not transparent

Cannot write jump atomically if crosses cache line

Even if write is atomic, not safe if overwrites part of next instruction

Jump may span code entry point

Too limited

Not safe w/o suspending all threads and knowing all entry points

Limited to inserting callouts

Code displaced by jump is a mini code cache

All the same consistency challenges of larger cache

Inter-operation issues with other hooks

Page 16: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 16

We Need Indirection

Avoid transparency and granularity limitations of directly modifying application code

Allow arbitrary modifications at unrestricted points in code stream

Allow systematic, fine-grained modifications to code stream

Guarantee that all code is observed

Page 17: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 17

Basic Interpreter

fetch

START

decode execute

Slowdown: ~300x

Page 18: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 18

Software Code Caches

Performance boost for runtime systems

Virtual machines, interpreters, dynamic translators

Dynamic compilers (JIT, etc.) and optimizers

Simulators and emulators

Indirection of runtime code manipulation

Avoid limitations of directly modifying application code

Dynamic tools: profiling, security, optimization, auditing, introspection, analysis, ...

Page 19: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 19

Improvement #1: Interpreter + Basic Block Cache

basic block builderSTART

dispatch

context switch

BASIC BLOCK CACHE

non-control-flow instructions

Non-control-flow instructions executed from software code cache

Slowdown: 300x 25x

Page 20: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 20

Example Basic Block Fragment

add %eax, %ecx

cmp $4, %eax

jle $0x40106f

add %eax, %ecx

cmp $4, %eax

jle <stub0>

jmp <stub1>

mov %eax, eax-slot

mov &dstub0, %eax

jmp context_switch

mov %eax, eax-slot

mov &dstub1, %eax

jmp context_switch

frag7:

stub0:

stub1:

dstub0

target: 0x40106f

dstub1

target: fall-thru

Page 21: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 21

Improvement #2: Linking Direct Branches

basic block builderSTART

dispatch

context switch

BASIC BLOCK CACHE

non-control-flow instructions

Direct branch to existing block can bypass dispatch

Slowdown: 300x 25x 3x

Page 22: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 22

Direct Linking

add %eax, %ecx

cmp $4, %eax

jle $0x40106f

add %eax, %ecx

cmp $4, %eax

jle <frag42>

jmp <frag8>

mov %eax, eax-slot

mov &dstub0, %eax

jmp context_switch

mov %eax, eax-slot

mov &dstub1, %eax

jmp context_switch

frag7:

stub0:

stub1:

dstub0

target: 0x40106f

dstub1

target: fall-thru

Page 23: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 23

Improvement #3: Linking Indirect Branches

basic block builderSTART

dispatch

context switch

BASIC BLOCK CACHE

non-control-flow instructions

indirect branch lookup

Application address mapped to code cache

Slowdown: 300x 25x 3x 1.2x

Page 24: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 24

Indirect Branch Transformation

ret

mov %ecx, ecx-slot

pop %ecx

jmp <stub0>

mov %ebx, eax-slot

mov &dstub0, %ebx

jmp ib_lookup

frag8:

stub0:

Page 25: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 25

Improvement #4: Trace Building

Traces reduce branching, improve layout and locality, and facilitate optimizations across blocks

We avoid indirect branch lookup

Next Executing Tail (NET) trace building scheme [Duesterwald 2000]

A

B

D G

E

C F

H

I

J

K

L

A

B

E

F

H

D

G

K

J

Basic Block Cache Trace Cache

Page 26: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 26

Incremental NET Trace Building

J

K

G

Basic Block Cache Trace Cache

G

J

K

G

K

J

K

G

K

J

G

G

Page 27: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 27

Improvement #4: Trace Building

basic block builder trace selectorSTART

dispatch

context switch

BASIC BLOCK CACHE

TRACE CACHE

non-control-flow instructions

non-control-flow instructions

indirect branch lookup

Slowdown: 300x 26x 3x 1.2x 1.1x

indirect branch stays on trace?

Page 28: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 28

Base Performance

SPEC CPU2000 DesktopServer

Page 29: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 29

Time Breakdown for SPECINT

basic block builder trace selectorSTART

dispatch

context switch

BASIC BLOCK CACHE

TRACE CACHE

non-control-flow instructions

non-control-flow instructions

indirect branch lookup

indirect branch stays on trace?

< 1%

~ 0% ~ 2%~ 94%~ 4%

Page 30: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 30

Sources of Overhead

Extra instructions

Indirect branch target comparisons

Indirect branch hashtable lookups

Extra data cache pressure

Indirect branch hashtable

Branch mispredictions

ret becomes jmp*

Application code modification

Page 31: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 31

Not An Ordinary Application

An application executing in DynamoRIO’s code cache looks different from what the underlying hardware has been tuned for

The hardware expects:

Little or no dynamic code modification

Writes to code are expensive

call and ret instructions

Return Stack Buffer predictor

Page 32: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 32

Performance Counter Data

Page 33: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 33

Internals Outline

Efficient

Software code cache overview

Thread-shared code cache

Cache capacity limits

Data structures

Transparent

Comprehensive

Customizable

Page 34: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 34

Threading Model

Thread1

Running Program

Code Caching Runtime System

Operating System

Thread1

Thread1

Thread2

Thread2

Thread2

Thread3

Thread3

Thread3

ThreadN

ThreadN

ThreadN

Page 35: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 35

Sharing Design Choices

Code cache

Basic blocks, traces, mixtures

Links between private and shared

Associated data structures

Pointers to private allowed inside shared?

Trace heads and profiling data

Indirect branch target lookup tables

Bruening et al. “Thread-Shared Software Code Caches” CGO’06

Page 36: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 36

Thread-Private Advantages

Code cache eviction is immediate

No synchronization for most code cache or data structure operations

Absolute addressing for thread-local storage

No indirection for in-cache lookup tables

Thread-specific optimization and instrumentation

Page 37: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 37

Thread-Private Disadvantages

Code cache and data structure duplication

Depends on code sharing among threads!

Desktop apps: not much is shared[Lee 1998, Bruening 2004]

Page 38: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 38

Database and Web Server Suite

Benchmark Server Processes

ab low IIS low isolation inetinfo.exe

ab med IIS medium isolation inetinfo.exe, dllhost.exe

guest lowIIS low isolation, SQL Server 2000

inetinfo.exe, sqlservr.exe

guest med IIS medium isolation, SQL Server 2000

inetinfo.exe, dllhost.exe, sqlservr.exe

Page 39: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 39

Server Applications

Page 40: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 40

Thread-Shared Code

Page 41: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 41

Memory Impact

ab med guest low guest med

Page 42: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 42

Performance Impact

Page 43: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 43

Scalability Limit

Page 44: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 44

Internals Outline

Efficient

Software code cache overview

Thread-shared code cache

Cache capacity limits

Data structures

Transparent

Comprehensive

Customizable

Page 45: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 45

Added Memory Breakdown

Page 46: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 46

Code Expansion

original code66%

net jumps8%

indirect branch target handling

7%

exit stubs19%

Page 47: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 47

Cache Capacity Challenges

How to set an upper limit on the cache size

Different applications have different working sets and different total code sizes

Which fragments to evict when that limit is reached

Without expensive profiling or extensive fragmentation

Bruening et al. “Maintaining Consistency and

Bounding Capacity of Software Code Caches” CGO’05

Page 48: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 48

Cache Capacity Settings

Thread-private:

Working set size matching is on by default

Client may see blocks or traces being deleted in the absence of any cache consistency event

Thread-shared:

Set to infinite size by default

Can enable capacity management via

-finite_shared_bb_cache

-finite_shared_trace_cache

Page 49: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 49

Internals Outline

Efficient

Software code cache overview

Thread-shared code cache

Cache capacity limits

Data structures

Transparent

Comprehensive

Customizable

Page 50: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 50

Two Modes of Code Cache Operation

Fine-grained scheme

Supports individual code fragment unlink and removal

Separate data structure per code fragment and each of its exits, memory regions spanned, and incoming links

Coarse-grained scheme

No individual code fragment control

Permanent intra-cache links

No per-fragment data structures at all

Treat entire cache as a unit for consistency

Page 51: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 51

Data Structures

Fine-grained scheme

Data structures are highly tuned and compact

Coarse-grained scheme

There are no data structures

Savings on applications with large amounts of code are typically 15%-25% of committed memory and 5%-15% of working set

Page 52: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 52

Status in Current Release

Fine-grained scheme

Current default

Coarse-grained scheme

Select with –opt_memory runtime option

Possible performance hit on certain benchmarks

In the future will be the default option

Required for persisted and process-shared caches

Page 53: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 53

Adaptive Level of Granularity

Start with coarse-grain caches

Plus freezing and sharing/persisting

Switch to fine-grain for individual modules or sub-regions of modules after significant consistency events, to avoid expensive entire-module flushes

Support simultaneous fine-grain fragments within coarse-grain regions for corner cases

Match amount of bookkeeping to amount of code change

Majority of application code does not need fine-grain

Page 54: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 54

Many Varieties of Code Caches

Coarse-grained versus fine-grained

Thread-shared versus thread-private

Basic blocks versus traces

Page 55: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 55

Internals Outline

Efficient

Transparent

Rules of transparency

Cache consistency

Synchronization

Comprehensive

Customizable

Page 56: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 56

Transparency

Do not want to interfere with the semantics of the program

Dangerous to make any assumptions about:

Register usage

Calling conventions

Stack layout

Memory/heap usage

I/O and other system call use

Page 57: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 57

Painful, But Necessary

Difficult and costly to handle corner cases Many applications will not notice… …but some will!

Microsoft Office: Visual Basic generated code, stack convention violations

COM, Star Office, MMC: trampolines

Adobe Premiere: self-modifying code

VirtualDub: UPX-packed executable

etc.

Page 58: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 58

Rule 1: Avoid Resource Conflicts

DynamoRIO system code executes at arbitrary points during application execution

If DynamoRIO uses the same library routine as the application, it may call that routine in the middle of the same routine being called by the application

Most library routines are not re-entrant!

Many are thread-safe, but that does not help us

Page 59: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 59

Rule 1: Avoid Resource Conflicts

Linux Windows

Page 60: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 60

Rule 2: If It’s Not Broken, Don’t Change It

Threads

Executable on disk

Application data

Including the stack!

Page 61: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 61

Example Transparency Violation

Error

Error

Error

Error

SPEC CPU2000 DesktopServer

Error

Error

Error

Error

Error

Error

Page 62: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 62

Rule 3: If You Change It, Emulate Original Behavior’s Visible Effects

Application addresses

Address space

Error transparency

Code cache consistency

Page 63: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 63

Internals Outline

Efficient

Transparent

Rules of transparency

Cache consistency

Synchronization

Comprehensive

Customizable

Page 64: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 64

Code Change Mechanisms

I-Cache

Store BFlush BJump B

A:

B:

C:

D:

D-CacheA:

B:

C:

D:

RISCI-Cache

Store BJump B

A:

B:

C:

D:

D-CacheA:

B:

C:

D:

x86

Page 65: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 65

How Often Does Code Change?

Not just modification of code!

Removal of code

Shared library unloading

Replacement of code

JIT region re-use

Trampoline on stack

Page 66: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 66

Code Change Events

Memory Unmappings

Generated Code Regions

Modified Code Regions

SPECFP 112 0 0

SPECINT 29 0 0

SPECJVM 7 3373 4591

Excel 144 21 20

Photoshop 1168 40 0

Powerpoint 367 28 33

Word 345 20 6

Page 67: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 67

Detecting Code Removal

Example: shared library being unloaded

Requires explicit request by application to operating system

Detect by monitoring system calls (munmap, NtUnmapViewOfSection)

Page 68: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 68

Detecting Code Modification

On x86, no explicit app request required, as the icache is kept consistent in hardware – so any memory write could modify code!

I-Cache

Store BJump B

A:

B:

C:

D:

D-CacheA:

B:

C:

D:

x86

Page 69: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 69

Page Protection Plus Instrumentation

Invariant: application code copied to code cache must be read-only

If writable, hide read-only status from application

Some code cannot or should not be made read-only

Self-modifying code

Windows stack

Code on a page with frequently written data

Use per-fragment instrumentation to ensure code is not stale on entry and to catch self-modification

Page 70: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 70

Adaptive Consistency Algorithm

Use page protection by default

Most code regions are always read-only

Subdivide written-to regions to reduce flushing cost of write-execute cycle

Large read-only regions, small written-to regions

Switch to instrumentation if write-execute cycle repeats too often (or on same page)

Switch back to page protection if writes decrease

Bruening et al. “Maintaining Consistency and Bounding Capacity of Software Code Caches” CGO’05

Page 71: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 71

Cache Eviction

Cache capacity management

Out of space

Upgraded to trace

Cache consistency management

Invalidate stale code

Page 72: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 72

Remove Stale Code

Thread suspension, or cache barrier, to achieve thread-free code cache

Too expensive to use frequently

Must ensure that all threads that were in the cache at the time of unlinking have exited the code cache at least once

Page 73: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 73

Non-Precise Flushing

Separate preventing entrance into stale code from actual removal

First step (prevent entrance):

Remove from indirect branch target tables (atomic w.r.t. in-cache lookup)

Unlink direct branches (also atomic if pad to avoid crossing cache lines)

Second step: removal

Same method as for capacity eviction

Page 74: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 74

Timestamp and Reference Count

Global timestamp++ on each invalidation

Place set of invalidated blocks on a to-free list

List records global timestamp and thread count

Each thread stores the timestamp it last walked the to-free list

Exit cache: thread walks to-free list

Decrement refcount for each not-yet-seen entry

If refcount is 0, free the code blocks

Page 75: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 75

Internals Outline

Efficient

Transparent

Rules of transparency

Cache consistency

Synchronization

Comprehensive

Customizable

Page 76: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 76

Synchronization Transparency

Application thread management should not interfere with the runtime system, and vice versa

Cannot allow the app to suspend a thread holding a runtime system lock

Runtime system cannot use app locks

Page 77: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 77

Code Cache Invariant

App thread suspension requires safe spots where no runtime system locks are held

Time spent in the code cache can be unbounded

Our invariant: no runtime system lock can be held while executing in the code cache

Page 78: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 78

Internals Outline

Efficient

Transparent

Comprehensive

Threads

Kernel-mediated control transfers

System calls

Customizable

Page 79: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 79

Above the Operating System

Page 80: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 80

Operating System Interposition

Transparency

Threads

Scratch space

Thread-local state

Synchronization

Kernel-mediated control transfers

System calls

Injection

Page 81: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 81

Complications from Threads

Data structure synchronization

Code cache management

Consistency

Eviction

Application thread synchronization

In-cache lookup tables

Thread-local storage

Page 82: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 82

Thread-Local Storage (TLS)

Absolute addressing

Thread-private only

Application stack

Not reliable or transparent

Stolen register

Performance hit

Segment

Page 83: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 83

Synchronization Optimizations

Finer-grained synchronization

Trace building that combines efficient private construction with shared results

In-cache lock-free lookup table access in the presence of entry invalidations

Delayed deletion algorithm based on timestamps and reference counts

Page 84: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 84

Internals Outline

Efficient

Transparent

Comprehensive

Threads

Kernel-mediated control transfers

System calls

Customizable

Page 85: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 85

Kernel-Mediated Control Transfers

message handler

kernel modeuser mode

message pendingsave user context

no message pendingrestore context

tim

e

majority of executed code in a typical Windows

application

Page 86: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 86

Challenge #1: Interception

Known solution:

Set up our own handler in place of the application’s

Page 87: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 87

register our own signal handler

DynamoRIO handler

Intercepting Linux Signals

signal handler

kernel modeuser mode

signal pendingsave user context

no signal pendingrestore context

tim

e

Page 88: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 88

dispatcher

Windows Messages

message handler

kernel modeuser mode

message pendingsave user context

no message pendingrestore context

tim

e

Page 89: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 89

modify shared library memory image

dispatcher

dispatcher

Intercepting Windows Messages

message handler

kernel modeuser mode

message pendingsave user context

no message pendingrestore context

tim

e

Page 90: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 90

Challenge #2: Continuation

No state across control transfer is best…

May not return to suspended context!

…unless context is stored in kernel

May not know where suspension point is!

A

B

D G

E

C F

H

I

J

K

L

Block Cache

Page 91: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 91

Callback Suspension Points

Windows messages are only delivered during certain alertable system calls

Can perform all system calls in a unique piece of code, instead of in fragments

A

B

D G

E

C F

H

I

J

K

L

Block Cache DynamoRIO

unique shared

system call point

Page 92: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 92

Challenge #3: Self-Interruption

Linux signals can be delivered at any time

Fortunately, state is in user mode

If DynamoRIO itself interrupted, must queue signal for later delivery

Must block all signals while in DynamoRIO’s handler to avoid memory allocation issues

Page 93: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 93

Challenge #4: Kernel Emulation

Exception and signal handlers are passed machine context of the faulting instruction

For transparency, that context must be translated from the code cache to the original code location

user context

faulting instr.

user context

faulting instr.

Page 94: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 94

Internals Outline

Efficient

Transparent

Comprehensive

Threads

Kernel-mediated control transfers

System calls

Customizable

Page 95: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 95

Must Monitor System Calls

To maintain control:

Calls that affect the flow of control: register signal handler, create thread, set thread context, etc.

To maintain transparency:

Queries of modified state app should not see

To maintain cache consistency:

Calls that affect the address space

To support cache eviction:

Interruptible system calls must be redirected

Page 96: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 96

Operating System Dependencies

System calls and their numbers

Monitor application’s usage, as well as for our own resource management

Windows changes the numbers each major rel

Details of kernel-mediated control flow

Must emulate how kernel delivers events

Initial injection

Once in, follow child processes

Page 97: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 97

Internals Outline

Efficient

Transparent

Comprehensive

Customizable

Page 98: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 98

Clients

The engine exports an API for building a client

System details abstracted away: client focuses on manipulating the code stream

Page 99: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 99

Client Events

basic block builder trace selectorSTART

dispatch

context switch

BASIC BLOCK CACHE

TRACE CACHE

non-control-flow instructions

non-control-flow instructions

indirect branch lookup

indirect branch stays on trace?

client client

client

Page 100: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Examples: Part I

Page 101: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

DynamoRIO API

Page 102: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 102

Clients

The engine exports an API for building a client

System details abstracted away: client focuses on manipulating the code stream

Page 103: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 103

Client Events

basic block builder trace selectorSTART

dispatch

context switch

BASIC BLOCK CACHE

TRACE CACHE

non-control-flow instructions

non-control-flow instructions

indirect branch lookup

indirect branch stays on trace?

client client

client

Page 104: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 104

Client Events: Code Stream

Client has opportunity to inspect and potentially modify every single application instruction, immediately before it executes

Entire application code stream

Basic block creation event: can modify the block

For comprehensive instrumentation tools

Or, focus on hot code only

Trace creation event: can modify the trace

Custom trace creation: can determine trace end condition

For optimization and profiling tools

Page 105: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 105

Simplifying Client View

Several optimizations disabled

Elision of unconditional branches

Indirect call to direct call conversion

Shared cache sizing

Process-shared and persistent code caches

Future release will give client control over optimizations

Page 106: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 106

Client Events: Application Actions

Application thread creation and deletion

Application library load and unload

Application fault or exception

Currently Windows only

Page 107: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 107

Client Events: Bookkeeping

Initialization and Exit

Entire process

Each thread

Child of fork (Linux-only)

Basic block and trace deletion during cache management

Nudge received

Used for communication into client (currently Windows only)

Page 108: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 108

DynamoRIO API: General Utilities

Transparency support

Separate memory allocation and I/O

Alternate stack

Thread support

Thread-local memory

Simple mutexes

Communication

Nudges

File creation, reading, and writing

Page 109: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 109

DynamoRIO API: General Utilities, Cont’d

Application inspection

Address space querying

Module iterator

Processor feature identification

Third-party library support

If transparency is maintained!

-wrap support on Linux

ntdll.dll link support on Windows

Page 110: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 110

DynamoRIO API: Instruction Representation

Full IA-32 instruction representation

Instruction creation with auto-implicit-operands

Operand iteration

Instruction lists with iteration, insertion, removal

Decoding at various levels of detail

Encoding

Page 111: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 111

Instruction Representation

cmp

shl

movzx

sub

mov

lea

jnl

8b 46 0c

8d 34 01

2b 46 1c

0f b7 4e 08

c1 e1 07

3b c1

0f 8d a2 0a 00 00 $0x77f52269

%eax %ecx

$0x07 %ecx -> %ecx

0x8(%esi) -> %ecx

0x1c(%esi) %eax -> %eax

0xc(%esi) -> %eax

(%ecx,%eax,1) -> %esi

WCPAZSO

WCPAZSO

-

WCPAZSO

-

-

RSO

opcode operandsraw bytes eflags

Page 112: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 112

Instruction Representation

cmp

shl

movzx

sub

mov

lea

jnl

c1 e1 07

3b c1

0f 8d a2 0a 00 00 $0x77f52269

%eax %ecx

$0x07 %ecx -> %ecx

0x8(%edi) -> %ecx

0x1c(%edi) %eax -> %eax

0xc(%edi) -> %eax

(%ecx,%eax,1) -> %edi

WCPAZSO

WCPAZSO

-

WCPAZSO

-

-

RSO

opcode operandsraw bytes eflags

Page 113: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 113

Linear Control Flow

Both basic blocks and traces are linear

Instruction sequences are all single-entrance, multiple-exit

Greatly simplifies analysis algorithms

Page 114: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 114

DynamoRIO API: 64-Bit

32-bit build of DynamoRIO only handles 32-bit code

64-bit build of DynamoRIO decodes/encodes both 32-bit and 64-bit code

Current release does not support executing applications that mix the two

IR is universal: covers both 32-bit and 64-bit

Abstracts away underlying mode

Page 115: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 115

DynamoRIO API: 64-Bit

When going to or from the IR, the thread mode and instruction mode determine how instrs are interpreted

When decoding, current thread’s mode is used

Default is 64-bit for 64-bit DynamoRIO

Can be changed with set_x86_mode()

When encoding, that instruction’s mode is used

When created, set to mode of current thread

Can be changed with instr_set_x86_mode()

Page 116: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 116

DynamoRIO API: 64-Bit

Define X86_64 before including header files when building a 64-bit client

Convenience macros for printf formats, etc. are provided

E.g.:

printf(“Pointer is ”PFX“\n”, p);

Page 117: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 117

DynamoRIO API: Code Manipulation

Meta-instructions

State preservation

Eflags, arith flags, floating-point state, MMX/SSE state

Spill slots, TLS

Clean calls to C code

Dynamic instrumentation

Replace code in the code cache

Branch instrumentation

Convenience routines

Page 118: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 118

DynamoRIO API: Flushing the Cache

Immediately deleting or replacing individual code cache fragments is available for thread-private caches

Only removes from that thread’s cache

Two basic types of thread-shared flush:

Non-precise: remove all entry points but let target cache code be invalidated and freed lazily

Precise/synchronous:

Suspend the world

Relocate threads inside the target cache code

Invalidate and free the target code immediately

Page 119: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 119

DynamoRIO API: Flushing the Cache

Thread-shared flush API routines:

dr_unlink_flush_region(): non-precise flush

dr_flush_region(): synchronous flush

dr_delay_flush_region():

If provide a completion callback, synchronous

Without a callback, non-precise

Page 120: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 120

DynamoRIO API: Translation

Translation refers to the mapping of a code cache machine state (program counter, registers, and memory) to its corresponding application state

The program counter always needs to be translated

Registers and memory may also need to be translated depending on the transformations applied when copying into the code cache

Page 121: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 121

Translation Case 1: Fault

Exception and signal handlers are passed machine context of the faulting instruction.

For transparency, that context must be translated from the code cache to the original code location

Translated location should be where the application would have had the fault or where execution should be resumed

user context

faulting instr.

user context

faulting instr.

Page 122: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 122

Translation Case 2: Relocation

If one application thread suspends another, or DynamoRIO suspends all threads for a synchronous cache flush:

Need suspended target thread in a safe spot

Not always practical to wait for it to arrive at a safe spot (if in a system call, e.g.)

DynamoRIO forcibly relocates the thread

Must translate its state to the proper application state at which to resume execution

Page 123: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 123

Translation Approaches

Two approaches to program counter translation:

Store mappings generated during fragment building

High memory overhead (> 20% for some applications, because it prevents internal storage optimizations) even with highly optimized difference-based encoding. Costly for something rarely used.

Re-create mapping on-demand from original application code

Cache consistency guarantees mean the corresponding application code is unchanged

Requires idempotent code transformations

DynamoRIO supports both approaches

The engine mostly uses the on-demand approach, but stored mappings are occasionally needed

Page 124: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 124

Instruction Translation Field

Each instruction contains a translation field

Holds the application address that the instruction corresponds to

Set via instr_set_translation()

Page 125: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 125

Context Translation Via Re-Creation

C1: mov %ebx, %ecx

C2: add %eax, (%ecx)

C3: cmp $4, (%eax)

C4: jle <stub0>

C5: jmp <stub1>

A1: mov %ebx, %ecx

A2: add %eax, (%ecx)

A3: cmp $4, (%eax)

A4: jle 710349fb

D1: (A1) mov %ebx, %ecx

D2: (A2) add %eax, (%ecx)

D3: (A3) cmp $4, (%eax)

D4: (A4) jle <stub0>

D5: (A4) jmp <stub1>

Page 126: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 126

Meta vs. Non-Meta Instructions

Non-Meta instructions are treated as application instructions

They must have translations

Control flow changing instructions are modified to retain DynamoRIO control and result in cache populating

Meta instructions are added instrumentation code

Not treated as part of the application (e.g., calls run natively)

Do not require translations, unless they may potentially fault (by referencing app memory, e.g.)

See instr_set_ok_to_mangle()

Page 127: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 127

Client Translation Support

Instruction lists passed to clients are annotated with translation information

Read via instr_get_translation()

Clients are free to delete instructions, change instructions and their translations, and add new meta and non-meta instructions (see dr_register_bb_event() for restrictions)

An idempotent client that restricts itself to deleting app instructions and adding non-faulting meta instructions can ignore translation concerns

DynamoRIO takes care of instructions added by API routines (insert_clean_call(), etc.)

Clients can choose between storing or regenerating translations on a fragment by fragment basis.

Page 128: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 128

Client Regenerated Translations

Client returns DR_EMIT_DEFAULT from its bb or trace event callback

Client bb & trace event callbacks are re-called when translations are needed with translating==true

Client must exactly duplicate transformations performed when the block was generated

Client must set translation field for all added non-meta instructions and all potentially faulting meta instructions

This is true even if translating==false since DynamoRIO may decide it needs to store translations anyway

Page 129: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 129

Client Stored Translations

Client returns DR_EMIT_STORE_TRANSLATIONS from its bb or trace event callback

Client must set translation field for all added non-meta instructions and all potentially faulting meta instructions

Client bb or trace hook will not be re-called with translating==true

Page 130: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 130

Register State Translation

Translation may be needed at a point where some registers are spilled to memory

During indirect branch or RIP-relative mangling, e.g.

DynamoRIO walks fragment up to translation point, tracking register spills and restores

Special handling for stack pointer around indirect calls and returns

DynamoRIO tracks client spills and restores added by API routines

Clean calls, etc.

Page 131: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 131

Client Register State Translation

If a client adds its own register spilling/restoring code or changes register mappings it must register for the restore state event to correct the context

The same event can also be used to fix up the application’s view of memory

DynamoRIO does not internally store this kind of translation information ahead of time when the fragment is built

The client must maintain its own data structures

Page 132: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 132

DynamoRIO API: Multiple Clients

It is each client's responsibility to ensure compatibility with other clients

Instruction stream modifications made by one client are visible to other clients

At client registration each client is given a priority

dr_init() called in priority order (priority 0 called first and thus registers its callbacks first)

Event callbacks called in reverse order of registration

Gives precedence to first registered callback, which is given the final opportunity to modify the instruction stream or influence DynamoRIO's operation

Page 133: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 133

DynamoRIO API: Non-Standard Deployment

Standalone API

Use DynamoRIO as a library of IA-32/AMD64 manipulation routines

Start/Stop API

Can instrument source code with where DynamoRIO should control the application

Page 134: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 134

DynamoRIO versus Pin

Basic interface is fundamentally different

Pin = insert callout/trampoline only

Not so different from tools that modify the original code: Dyninst, Vulcan, Detours

Uses code cache only for transparency

DynamoRIO = arbitrary code stream modifications

Only feasible with a code cache

Takes full advantage of power of code cache

Page 135: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 135

DynamoRIO versus Pin

Pin = insert callout/trampoline only

Pin tries to inline and optimize

Client has little control or guarantee over final performance

DynamoRIO = arbitrary code stream modifications

Client has full control over all inserted instrumentation

Result can be significant performance difference

PiPA Memory Profiler + Cache Simulator: 3.27x speedup w/ DynamoRIO vs 2.6x w/ Pin

Page 136: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 136

Base Performance Comparison (No Tool)

171%

121%

Page 137: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 137

Base Memory Comparison

15MB

2.6MB

44MB

3.5MB

Page 138: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 138

BBCount Pin Tool

static int bbcount;

VOID PIN_FAST_ANALYSIS_CALL docount() { bbcount++; }

VOID Trace(TRACE trace, VOID *v) {

for (BBL bbl = TRACE_BblHead(trace); BBL_Valid(bbl); bbl = BBL_Next(bbl)) {

BBL_InsertCall(bbl, IPOINT_ANYWHERE, AFUNPTR(docount),

IARG_FAST_ANALYSIS_CALL, IARG_END);

}

}

int main(int argc, CHAR *argv[]) {

PIN_InitSymbols();

PIN_Init(argc, argv);

TRACE_AddInstrumentFunction(Trace, 0);

PIN_StartProgram();

return 0;

}

Page 139: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 139

BBCount DynamoRIO Tool

static int global_count;

static dr_emit_flags_tevent_basic_block(void *drcontext, void *tag, instrlist_t *bb, bool for_trace, bool translating) { instr_t *instr, *first = instrlist_first(bb); uint flags; /* Our inc can go anywhere, so find a spot where flags are dead. */ for (instr = first; instr != NULL; instr = instr_get_next(instr)) { flags = instr_get_arith_flags(instr); /* OP_inc doesn't write CF but not worth distinguishing */ if (TESTALL(EFLAGS_WRITE_6, flags) && !TESTANY(EFLAGS_READ_6, flags)) break; } if (instr == NULL) dr_save_arith_flags(drcontext, bb, first, SPILL_SLOT_1); instrlist_meta_preinsert(bb, (instr == NULL) ? first : instr, INSTR_CREATE_inc(drcontext, OPND_CREATE_ABSMEM((byte *)&global_count, OPSZ_4))); if (instr == NULL) dr_restore_arith_flags(drcontext, bb, first, SPILL_SLOT_1); return DR_EMIT_DEFAULT;}

DR_EXPORT void dr_init(client_id_t id) { dr_register_bb_event(event_basic_block);}

Page 140: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 140

BBCount: Pin Inlining Importance

569%

231%

Page 141: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 141

BBCount Performance Comparison

226%

185%

233%

Page 142: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 142

Changes From Prior Releases

Not backward compatible with 0.9.1-0.9.5

Client sees original application code in bb event

No mangling/eliding at all

Event registration model (vs client exports)

IR simplification (levels are not exposed)

Easier-to-use clean calls

Multiple clients

Module iteration and memory querying

Library support (-wrap, ntdll.dll)

Page 143: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 143

Changes From Prior Releases, Cont’d

Infrastructure improvements

64-bit support

Thread-shared caches (can still request thread-private)

Access to tls slots

Removed features you can implement yourself:

Edge-counting profile build

Custom exit stubs and prefixes

Full API documented in html and pdf

Numerous type and routine name changes

Page 144: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Examples: Part 2

Page 145: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Feedback

Page 146: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 146

Future Releases

Persistent and process-shared caches Auto-inline callouts Client threads Attach to a process Full control over traces Symbol table lookup support Add signal event for linux clients Add syscall pre/post events and param access routines Trace post-mangling last-shot event (like 0.9.4 trace event) Further C++/3rd party library support 64-bit client controlling 32-bit app Your suggestion here

Page 147: Building Dynamic Instrumentation Tools With DynamoRIO Derek Bruening Timothy Garnett

Copyright © 2008 VMware, Inc. All rights reserved. 147

Feedback

Questions for you

Feedback on what you want to see in API

Thank you!