catenation and operand specialization for tcl vm performance

70
Catenation and Operand Specialization For Tcl VM Performance Benjamin Vitale 1 , Tarek Abdelrahman U. Toronto 1 Now with OANDA Corp.

Upload: merle

Post on 31-Jan-2016

28 views

Category:

Documents


0 download

DESCRIPTION

Catenation and Operand Specialization For Tcl VM Performance. Benjamin Vitale 1 , Tarek Abdelrahman U. Toronto 1 Now with OANDA Corp. Preview. Techniques for fast interpretation (of Tcl) Or slower, too! Lightweight compilation; a point between interpreters and JITs - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Catenation and Operand Specialization  For Tcl VM Performance

Catenation and Operand Specialization

For Tcl VM Performance

Benjamin Vitale1, Tarek Abdelrahman

U. Toronto

1Now with OANDA Corp.

Page 2: Catenation and Operand Specialization  For Tcl VM Performance

2

Preview

• Techniques for fast interpretation (of Tcl)

• Or slower, too!

• Lightweight compilation; a point between interpreters and JITs

• Unwarranted chumminess with Sparc assembly language

Page 3: Catenation and Operand Specialization  For Tcl VM Performance

3

Tclsource

Nativemachine

code

Tclbytecodecompiler

JIT Compiler

Catenation Specialization Linking

RuntimeInitialization

Templateextraction

Patchsynthesis

Implementation

Page 4: Catenation and Operand Specialization  For Tcl VM Performance

4

Outline

• VM dispatch overhead

• Techniques for removing overhead, and consequences

• Evaluation

• Conclusions

Page 5: Catenation and Operand Specialization  For Tcl VM Performance

5

Interpreter Speed

• What makes interpreters slow?

• One problem is dispatch overhead– Interpreter core is a dispatch loop– Probably much smaller than run-time system– Yet, focus of considerable interest– Simple, elegant, fertile

Page 6: Catenation and Operand Specialization  For Tcl VM Performance

6

for (;;)opcode = *vpc++ Fetch opcodeswitch (opcode) Dispatch

case INST_DUPobj = *stack_top*++stack_top = objbreak

case INST_INCRarg = *vpc++ Fetch operand *stack_top += arg

Typical Dispatch Loop

Real work

Page 7: Catenation and Operand Specialization  For Tcl VM Performance

7

Dispatch Overhead

Cycles Instructions

Real work ~4 5

Operand fetch ~6 6

Dispatch 19 10

• Execution time of Tcl INST_PUSH

Page 8: Catenation and Operand Specialization  For Tcl VM Performance

8

Goal: Reduce Dispatch

Dispatch TechniqueSPARC Cycle

Time

for/switch 19

token threaded, decentralized next

14

direct threaded, decentralized next

10

selective inlining (average)

Piumarta & Riccardi PLDI’98<<10

? 0

Page 9: Catenation and Operand Specialization  For Tcl VM Performance

9

Native Code the Easy Way

• To eliminate all dispatch, we must execute native code

• But, we’re lazy hackers

• Instead of a writing a real code generator, use interpreter as source of templates

Page 10: Catenation and Operand Specialization  For Tcl VM Performance

10

Interpreter Structure

VirtualProgram

push 0push 1add

inst_pushinst_fooinst_add

InstructionTable

body of pushdispatch

body of foodispatch

body of adddispatch

InterpreterNative Code• Dispatch loop sequences bodies

according to virtual program

Page 11: Catenation and Operand Specialization  For Tcl VM Performance

11

Catenation

Virtual Program

push 0push 1add

“Compiled”Native Code

copy of code forinst_push

copy of code forinst_push

copy of code forinst_add

• No dispatch required

• Control falls-through naturally

Page 12: Catenation and Operand Specialization  For Tcl VM Performance

12

Opportunities

• Catenated code has a nice feature

• A normal interpreter has one generic implementation for each opcode

• Catenated code has separate copies

• This yields opportunities for further optimization. However…

Page 13: Catenation and Operand Specialization  For Tcl VM Performance

13

Challenges

• Code is not meant to be moved after linking

• For example, some pc-relative instructions are hard to move, including some branches and function calls

• But first, the good news

Page 14: Catenation and Operand Specialization  For Tcl VM Performance

14

Exploiting Catenated Code

• Separate bodies for each opcode yield three nice opportunities

1. Convert virtual branches to native

2. Remove virtual program counter

3. Reduce operand fetch code to runtime constants

Page 15: Catenation and Operand Specialization  For Tcl VM Performance

15

Virtual branches become Native

L1: dec_var xpush 0cmpbzvirtual L1done

bytecode

L2: dec_var bodypush bodycmp bodybznative L2done body

native code

• Arrange for cmp body to set condition code

• Emit synthesized code for bz; don’t memcpy

Page 16: Catenation and Operand Specialization  For Tcl VM Performance

16

Eliminating Virtual PC

• vpc is used by dispatch, operand fetch, virtual branches – and exception handling

• Remove code to maintain vpc, and free up register

• For exceptions, we rematerialize vpc

Page 17: Catenation and Operand Specialization  For Tcl VM Performance

17

Rematerializing vpc

1: inc_var 1

3: push 0

5: inc_var 2

code for inc_varif (err)

vpc = 1br exception

code for pushcode for inc_var

if (err)vpc = 5br exception

Bytecode Native Code

• Separate copies

• In copy for vpc 1, set vpc = 1

Page 18: Catenation and Operand Specialization  For Tcl VM Performance

18

Moving Immovable Code

• pc-relative instructions can break:

7000: call +2000 (9000 <printf>)

3000: call +2000 (5000 <????>)

Page 19: Catenation and Operand Specialization  For Tcl VM Performance

19

Patching Relocated Code

• Patch pc-relative instructions so they work:

3000: call +2000 (5000 <????>)

3000: call +6000 (9000 <printf>)

Page 20: Catenation and Operand Specialization  For Tcl VM Performance

20

PatchesInput Types

ARGLITERAL

BUILTIN_FUNCJUMP

PCNONE

Output Types

SIMM13SETHI/OR

JUMPCALL

• Objects describing change to code

• Input: Type, position, and size of operand in bytecode instruction

• Output: Type and offset of instruction in template

• Only 4 output types on Sparc!

Page 21: Catenation and Operand Specialization  For Tcl VM Performance

21

Interpreter Operand Fetch

add l5, 1, l5 increment vpc to operandldub [l5], o0 load operand from bytecode streamld [fp+48], o2 get bytecode object addr from C stackld [o2+4c], o1 get literal tbl addr from bytecode objsll o0, 2, o0 compute offset into literal tableld [o1+ o0], o1 load from literal table

push 1 00 01

Bytecode Instruction

0 0xf81d41 0xfa008 “foo”

Literal Table

Operand Fetch

Page 22: Catenation and Operand Specialization  For Tcl VM Performance

22

Operand Specialization

sethi o1, hi(obj_addr)or o1, lo(obj_addr)

• Array load becomes a constant

• Patch input: one-byte integer literal at offset 1

• Patch output: sethi/or at offset 0

add l5, 1, l5ldub [l5], o0ld [fp+48], o2ld [o2+4c], o1sll o0, 2, o0ld [o1+ o0], o1

Page 23: Catenation and Operand Specialization  For Tcl VM Performance

23

Net Improvement

sethi o1, hi(obj_addr)or o1, lo(obj_addr)st o1, [l6]ld [o1], o0inc o0st o0, [o1]

Final Template for push

• Interpreter:11 instructions + 8 dispatch

• Catenated:6 instructions + 0 dispatch

• push is shorter than most, but very common

Page 24: Catenation and Operand Specialization  For Tcl VM Performance

24

Evaluation

• Performance

• Ideas

Page 25: Catenation and Operand Specialization  For Tcl VM Performance

25

Compilation Time

• Templates are fixed size, fast

• Two catenation passes– compute total length– memcpy, apply patches (very fast)

• adds 30 - 100% to bytecode compile time

Page 26: Catenation and Operand Specialization  For Tcl VM Performance

26

Execution Time

Page 27: Catenation and Operand Specialization  For Tcl VM Performance

27

Varying I-Cache Size

• Four hypothethical I-cache sizes

• Simics Full Machine Simulator

• 520 Tcl Benchmarks

• Run both interpreted and catenating VM

Page 28: Catenation and Operand Specialization  For Tcl VM Performance

28

Varying I-Cache Size

45

66

82

54

0%

100%

32 128 1024 infinite

I-Cache Size (KB)

# o

f B

en

ch

ma

rks

Imp

rov

ed

Page 29: Catenation and Operand Specialization  For Tcl VM Performance

29

Branch Prediction - Catenation

• No dispatch branches

• Virtual branches become native

• Similar CFG for native and virtual program

• BTB knows what to do

• Prediction rate similar to statically compiled code: excellent for many programs

Page 30: Catenation and Operand Specialization  For Tcl VM Performance

30

Implementation Retrospective

• Getting templates from interpreter is fun

• Too brittle for portability, research

• Need explicit control over code gen

• Write own code generator, or

• Make compiler scriptable?

Page 31: Catenation and Operand Specialization  For Tcl VM Performance

31

Related Work

• Ertl & Gregg, PLDI 2003– Efficient Interpreters (Forth, OCaml)– Smaller bytecodes, more dispatch overhead– Code growth, but little I-cache overflow

• DyC: m88ksim

• Qemu x86 simulator (F. Bellard)

• Many others; see paper

Page 32: Catenation and Operand Specialization  For Tcl VM Performance

32

Conclusions

• Many ways to speed up interpreters

• Catenation is a good idea, but like all inlining needs selective application

• Not very applicable to Tcl’s large bytecodes

• Ever-changing micro-architectural issues

Page 33: Catenation and Operand Specialization  For Tcl VM Performance

33

Future Work

• Investigate correlation between opcode body size, I-cache misses

• Selective outlining, other adaptation

• Port: another architecture; an efficient VM

• Study benefit of each optimization separately

• Type inference

Page 34: Catenation and Operand Specialization  For Tcl VM Performance

34

The End

Page 35: Catenation and Operand Specialization  For Tcl VM Performance

35

JIT emitting

• Interpret patches

• A few loads, shifts, adds, and stores

Emitter

Template w/Patches

SpecializedNative Code

BytecodeInstruction

Page 36: Catenation and Operand Specialization  For Tcl VM Performance

36

Token Threading in GNU C

#define NEXT goto *(instr_table [*vpc++])

Enum {INST_ADD, INST_PUSH, …};char prog [] = {INST_PUSH, 2, INST_PUSH, 3, INST_MUL, …};void *instr_table [ ] = { &&INST_ADD, &&INST_PUSH, …};

INST_PUSH:/* … implementation of PUSH … */NEXT;

INST_ADD:/* … implementation of ADD … */NEXT;

Page 37: Catenation and Operand Specialization  For Tcl VM Performance

37

Virtual Machines are Everywhere

• Perl, Tcl, Java, Smalltalk. grep?

• Why so popular?

• Software layering strategy

• Portability, Deploy-ability, Manageability

• Very late binding

• Security (e.g., sandbox)

Page 38: Catenation and Operand Specialization  For Tcl VM Performance

38

Software layering strategy

• Software getting more complex

• Use expressive higher level languages

• Raise level of abstraction

Runtime SysOS

Machine

App

Runtime SysOS

Machine

App layer 2

App runtimeApp layer 1 VM

App

Page 39: Catenation and Operand Specialization  For Tcl VM Performance

39

Problem: Performance

• Interpreters are slow: 1000 – 10 times slower than native code

• One possible solution: JITs

Page 40: Catenation and Operand Specialization  For Tcl VM Performance

40

Just-In-Time Compilation

• Compile to native inside VM, at runtime

• But, JITs are complex and non-portable – would be most complex and least portable part of, e.g. Tcl

• Many JIT VMs interpret sometimes

Page 41: Catenation and Operand Specialization  For Tcl VM Performance

41

Reducing Dispatch Count

• In addition to reducing cost of each dispatch, we can reduce the number of dispatches

• Superinstructions: static, or dynamic, e.g.:

• Selective Inlining

Piumarta & Riccardi, PLDI’98

Page 42: Catenation and Operand Specialization  For Tcl VM Performance

42

Switch Dispatch Assemblyfor_loop:

ldub [i0], o0 fetch opcode

switch:cmp o0, 19 bounds checkbgu for_loop

add i0, 1, i0 increment vpc

sethi hi(inst_tab), r0 lookup addror r0, lo(inst_tab),r0sll o0, 2, o0ld [r0 + o0], o2

jmp o2 dispatchnop

Page 43: Catenation and Operand Specialization  For Tcl VM Performance

43

Push Opcode Implementation

add l6, 4, l6 ; increment VM stack pointeradd l5, 1, l5 ; increment vpc past opcode. Now at operandldub [l5], o0 ; load operand from bytecode streamld [fp + 48], o2 ; get bytecode object addr from C stackld [o2 + 4c], o1 ; get literal tbl addr from bytecode objsll o0, 2, o0 ; compute array offset into literal tableld [o1 + o0], o1 ; load from literal tablest o1, [l6] ; store to top of VM stackld [o1], o0 ; next 3 instructions increment ref countinc o0st o0, [o1]

• 11 instructions

Page 44: Catenation and Operand Specialization  For Tcl VM Performance

44

Indirect (Token) Threading

0 110 7456

Interpreter Code

0 push 0x80001000

1 pop 0x80001020

2 add 0x80001034

3 sub 0x80001058

4 mul 0x80001072

5 print 0x8000111c

6 done 0x80001024

Instruction TableBytecode

push 11push 7mulprintdone

Bytecode "asm" inst_push: ......next

inst_pop: ......next

inst_add: ......next

...

Page 45: Catenation and Operand Specialization  For Tcl VM Performance

45

Token Threading Example

#define TclGetUInt1AtPtr(p) ((unsigned int) *(p))#define Tcl_IncrRefCount(objPtr) ++(objPtr)->refCount#define NEXT goto *jumpTable [*pc];

case INST_PUSH: Tcl_Obj *objectPtr;

objectPtr = codePtr->objArrayPtr [TclGetUInt1AtPtr (pc + 1)]; *++tosPtr = objectPtr; /* top of stack */ Tcl_IncrRefCount (objectPtr); pc += 2; NEXT;

Page 46: Catenation and Operand Specialization  For Tcl VM Performance

46

Token Threaded Dispatch

• 8 instructions

• 14 cyclessethi hi(800), o0or o0, 2f0, o0ld [l7 + o0], o1ldub [l5], o0sll o0, 2, o0ld [o1 + o0], o0jmp o0nop

Page 47: Catenation and Operand Specialization  For Tcl VM Performance

47

inst_push: ......next

inst_mul: ......next

inst_print: ......next

inst_add: ...next

inst_done: ...next

0x80001000 110x80001000 70x800010720x8000111c0x80001024

Interpreter Code

Bytecode

Represent program instructions withaddress of their interpreterimplementations

Direct Threading

Page 48: Catenation and Operand Specialization  For Tcl VM Performance

48

Direct Threaded Dispatch

• 4 instructions

• 10 cycles

ld r1 = [vpc]

add vpc = vpc + 4

jmp *r1

nop

Page 49: Catenation and Operand Specialization  For Tcl VM Performance

49

Direct Threading in GNU C

#define NEXT goto *(*vpc++)

int prog [] ={&&INST_PUSH, 2, &&INST_PUSH, 3, &&INST_MUL, …};

INST_PUSH:/* … implementation of PUSH … */NEXT;

INST_ADD:/* … implementation of ADD … */NEXT;

Page 50: Catenation and Operand Specialization  For Tcl VM Performance

50

Superinstructionsiload 3

iload 1

iload 2

imul

iadd

istore 3

iinc 2 1

iload 1bipush 10if_icmplt 7

iload 2

bipush 20

if_icmplt 12

iinc 1 1

• Note repeated opcode sequence

• Create new synthetic opcode

iload_bipush_if_icmplt

• takes 3 parms

Page 51: Catenation and Operand Specialization  For Tcl VM Performance

51

Copying Native Code

inst_push: ldstnext

inst_mul: ldmulstnext

inst_print: ......next

...

inst_push_mul:ldstldmulstnext

inst_push: ldstnext

inst_mul: ld... ...

Page 52: Catenation and Operand Specialization  For Tcl VM Performance

52

inst_add Assembly

inst_table:

.word inst_add switch table

.word inst_push

.word inst_print

.word inst_done

inst_add:

ld [i1], o1 arg = *stack_top--

add i1, -4, i1

ld [i1], o0 *stack_top += arg

add o0, o1, o0

st o0, [i1]

b for_loop dispatch

Page 53: Catenation and Operand Specialization  For Tcl VM Performance

53

Copying Native Code

uint push_len = &&inst_push_end - &&inst_push_start;

uint mul_len = &&inst_mul_end - &&inst_mul_start;

void *codebuf = malloc (push_len + mul_len + 4);

mmap (codebuf, MAP_EXEC);

memcpy (codebuf, &&inst_push_start, push_len);

memcpy (codebuf + push_len, &&inst_mul_start, mul_len);

/* … memcpy (dispatch code) */

Page 54: Catenation and Operand Specialization  For Tcl VM Performance

54

Limitations of Selective Inlining

• Code is not meant to be memcpy’d

• Can’t move function calls, some branches

• Can’t jump into middle of superinstruction

• Can’t jump out of middle (actually you can)

• Thus, only usable at virtual basic block boundaries

• Some dispatch remains

Page 55: Catenation and Operand Specialization  For Tcl VM Performance

55

Catenation

• Essentially a template compiler

• Extract templates from interpreter

Page 56: Catenation and Operand Specialization  For Tcl VM Performance

56

Catenation - Branches

L1: inc_var 1

push 2

cmp

beq L1

L1: code for inc_var

code for push

code for cmp

code for beq-test

beq L1

bytecode native code

•Virtual branches become native branches

•Emit synthesized code; don’t memcpy

Page 57: Catenation and Operand Specialization  For Tcl VM Performance

57

Operand Fetch

• In interpreter, one generic copy of push for all virtual instructions, with any operands

• Java, Smalltalk, etc. have push_1, push_2

• But, only 256 bytecodes

• Catenated code has separate copy of push for each instruction

push 1

push 2

inc_var 1

sample code

Page 58: Catenation and Operand Specialization  For Tcl VM Performance

58

Threaded Code,Decentralized Dispatch

• Eliminate bounds check by avoiding switch• Make dispatch explicit• Eliminate extra branch by not using for

• James Bell, 1973 CACM• Charles Moore, 1970, Forth

• Give each instruction its own copy of dispatch

Page 59: Catenation and Operand Specialization  For Tcl VM Performance

59

Why Real Work Cycles Decrease

• We do not separately show improvements from branch conversion, vpc elimination, and operand specialization

Page 60: Catenation and Operand Specialization  For Tcl VM Performance

60

Why I-cache Improves

• Useful bodies packed tightly in instruction memory (in interpreter, unused bodies pollute I-cache)

Page 61: Catenation and Operand Specialization  For Tcl VM Performance

61

Operand Specialization

• push not typical; most instructions much longer (for Tcl)

• But, push is very common

Page 62: Catenation and Operand Specialization  For Tcl VM Performance

62

Micro Architectural Issues

• Operand fetch includes 1 - 3 loads

• Dispatch includes 1 load, 1 indirect jump

• Branch prediction

Page 63: Catenation and Operand Specialization  For Tcl VM Performance

63

Branch Prediction

for

switch

subpush pop beqadd ...

*

0 * Last Op

1

2

3

4

5

6

7

8

9

BTBControl Flow Graph - Switch Dispatch

• 85 - 100% mispredictions [Ertl 2003]

Page 64: Catenation and Operand Specialization  For Tcl VM Performance

64

Better Branch Prediction

CFG - Threaded Code

add

pop beq

push

sub

B 1

B 2

B 3

B 4

B 5

B1 Last Succ

B2 Last Succ

B3 Last Succ

B4 Last Succ

B5 Last Succ

B6 Last Succ

B7 Last Succ

B8 Last Succ

B9 Last Succ

B10 Last Succ

BTB•Approx. 60% mispredict

Page 65: Catenation and Operand Specialization  For Tcl VM Performance

65

How Catenation Works

• Scan templates for patterns at VM startup– Operand specialization points– vpc rematerialization points– pc-relative instruction fixups

• Cache results in a “compiled” form

• Adds 4 ms to startup time

Page 66: Catenation and Operand Specialization  For Tcl VM Performance

66

From Interpreter to Templates

• Programming effort:– Decompose interpreter into 1 instruction

case per file– Replace operand fetch code with magic

numbers

Page 67: Catenation and Operand Specialization  For Tcl VM Performance

67

From Interpreter to Templates 2

• Software build time (make):– compile C to assembly (PIC)– selectively de-optimize assembly– Conventional link

Page 68: Catenation and Operand Specialization  For Tcl VM Performance

68

#ifdef INTERPRET

#define MAGIC_OP1_U1_LITERAL codePtr->objArray [GetUInt1AtPtr (pc + 1)]

#define PC_OP(x) pc ## x #define NEXT_INSTR break

#elseif COMPILE

#define MAGIC_OP1_U1_LITERAL (Tcl_Obj *) 0x7bc5c5c1 #define NEXT_INSTR goto *jump_range_table [*pc].start #define PC_OP(x) /* unnecessary */

#endif

case INST_PUSH1: Tcl_Obj *objectPtr;

objectPtr = MAGIC_OP1_U1_LITERAL; *++tosPtr = objectPtr; /* top of stack */ Tcl_IncrRefCount (objectPtr); PC_OP (+= 2); NEXT_INSTR; /* dispatch */

Magic Numbers

Page 69: Catenation and Operand Specialization  For Tcl VM Performance

69

Dynamic Execution Frequency

0.0%

5.0%

10.0%

15.0%lo

adS

cala

r1pu

sh1

pop

load

Sca

lar4

jum

pFal

se1

jum

p1pu

sh4

stor

eSca

lar1

fore

ach_

step

4du

pad

din

voke

Stk

1ju

mpF

alse

4ju

mp4

stor

eSca

lar4

begi

nCat

ch4

done

incr

Sca

lar1

Imm

bito

rbi

tand

bitx

orsu

bin

voke

Stk

4lo

adA

rray

Stk

rshi

fteq

lttry

Cvt

ToN

umer

iclo

adS

cala

rStk

listin

dex

lshi

ftco

ncat

1in

crS

cala

r1la

ppen

dSca

lar1

appe

ndS

cala

r1lo

adA

rray

1la

ppen

dSca

lar4

mul

tin

crS

cala

rStk

Imm

callB

uilti

nFun

c1

Page 70: Catenation and Operand Specialization  For Tcl VM Performance

70

0

100

200

300

400

loadScalar1

push1poploadS

calar4jum

pFalse1

jump1

push4storeS

calar1foreach_step4dupaddinvokeS

tk1jum

pFalse4

jump4

storeScalar4

beginCatch4

doneincrS

calar1Imm

bitorbitandbitxorsubinvokeS

tk4loadA

rrayStk

rshifteq lt tryC

vtToNum

ericloadS

calarStk

listindexlshiftconcat1incrS

calar1lappendS

calar1appendS

calar1loadA

rray1lappendS

calar4m

ultincrS

calarStkIm

mcallB

uiltinFunc1

callFunc1

notgt appendS

calar4listlengthloadA

rray4endC

atchdivbitnotstoreA

rray1

<<<< Dynamically more frequent

Bo

dy

len

gth

(n

tv i

nsn

s).

Instruction Body Length