context threading: a flexible and efficient dispatch...
TRANSCRIPT
1Research supported by IBM CAS, NSERC, CITO
Context Threading: A flexible and efficient dispatch technique for
virtual machine interpreters
Marc BerndlBenjamin VitaleMathew Zaleski
Angela Demke Brown
Context Threading 2
Interpreter performance
• Why not just in time (JIT) compile?• High performance JVMs still interpret
• People use interpreted languages that don’t yet have JITs
• They still want performance!
• 30-40% of execution time is due to stalls caused by branch misprediction.
• Our technique eliminates 95% of branch mispredictions
Context Threading 3
Overview
✔Motivation
•Background: The Context Problem
•Existing Solutions
•Our Approach
•Inlining
•Results
Context Threading
load
4
A Tale of Two Machines
LoadedProgram
VirtualProgram
Return Address
Wayness(Conditional)
Execution Cycle
BytecodeBodies
Pipeline
Target Address(Indirect)
Pre
dicto
rs
Execution Cycle
Virtual Machine Interpreter
Real MachineCPU
Context Threading 5
Interpreter
LoadedProgram
Bytecodebodies
Internal Representation
fetch
dispatchLoadParms
execute
Execution Cycle
Context Threading
0: iconst_0 1: istore_1 2: iload_1 3: iload_1 4: iadd 5: istore_1 6: iload_1 7: bipush 64 9: if_icmplt 2 12: return
6
Running Java Example
void foo(){ int i=1; do{ i+=i; } while(i<64); }
Java Source Java Bytecode
Javac compiler
Context Threading
while(1){ opcode = *vPC++; switch(opcode){
//and many more..
}};
7
Switched Interpreter
case iload_1: ..
break;
case iadd: ..
break;
slow. burdened by switch and loop overhead
Context Threading
“Threading” Dispatch
‣ No switch overhead. Data driven indirect branch.8
execution of virtual program
“threads” through bodies
(as in needle & thread)
iload_1: ..goto *vPC++;
iadd: ..goto *vPC++;
istore: ..goto *vPC++;
0: iconst_0 1: istore_1 2: iload_1 3: iload_1 4: iadd 5: istore_1 6: iload_1 7: bipush 64 9: if_icmplt 2 12: return
Context Threading
0: iconst_0 1: istore_1 2: iload_1 3: iload_1 4: iadd 5: istore_1 6: iload_1 7: bipush 64 9: if_icmplt 2 12: return
Context Problem
‣ Data driven indirect branches hard to predict9
iload_1: ..goto *vPC++;
iadd: ..goto *vPC++;
istore: ..goto *vPC++;
indirect branch predictor
(micro-arch)
Context Threading 10
Direct Threaded Interpreter
-7
&&if_icmplt64&&bipush&&iload_1&&istore_1&&iadd&&iload_1&&iload_1
…iload_1iload_1iaddistore_1iload_1bipush 64if_icmplt 2…
DTT - DirectThreading Table
VirtualProgram
vPC iload_1: ..goto *vPC++;
iadd: ..goto *vPC++;
Target of computed goto is data-driven
C implementationof each body
istore: ..goto *vPC++;
Context Threading 11
Existing Solutions
BodyBodyBodyBodyBody
GOTO *PC
????
Piumarta & Ricardi :Bodies Replicated
Super InstructionReplicate
iload_1goto *pc
1
iload_1goto *pc
2
1
1
2
2
Ertl & Gregg:Bodies and Dispatch
Replicated
Limited to relocatable virtual instructions
Context Threading 12
Overview
✔Motivation✔Background: The Context Problem✔Existing Solutions
• Our Approach
•Inlining
•Results
Context Threading 13
Key Observation
• Virtual and native control flow similar• Linear or straight-line code
• Conditional branches
• Calls and Returns
• Indirect branches
• Hardware has predictors for each type• Direct uses indirect branch for everything!
‣ Solution: Leverage hardware predictors
Context Threading 14
Essence of our Solution
iload_1: ..ret;
iadd: ..ret;
..call iload_1call istore_1call iaddcall iload_1call iload_1
CTT - ContextThreading Table
(generated code)
Bytecode bodies (ret terminated)
Return Branch Predictor Stack
…iload_1iload_1iaddistore_1iload_1bipush 64if_icmplt 2…
Package bodies as subroutines and call them
Context Threading 15
Subroutine Threading
iload_1: …ret;
iadd: …ret;
call bipush call if_icmplt
call iload_1 call istore_1 call iadd call iload_1 call iload_1
CTT load timegenerated code
Bytecode bodies (ret terminated)
if_cmplt: …goto *vPC++;
virtual branch instructions as before
…iload_1iload_1iaddistore_1iload_1bipush 64if_icmplt 2… 64
-7
DTT containsaddresses in CTT
vPC
Context Threading 16
The Context Threading Table
• A sequence of generated call instructions
• Good alignment of virtual and hardware control flow for straight-line code.
‣ Can virtual branches go into the CTT?
Context Threading 17
Specialized Branch Inlining
Conditional Branch
Predictor now mobilized
……target:
…call …call iload_1
if(icmplt) goto target:
…
Branch Inlined Into the CTT
5
DTT
vPC
target:…
…
Inlining conditional branches provides context
Context Threading 18
Tiny Inlining
• Context Threading is a dispatch technique • But, we inline branches
• Some non-branching bodies are very small• Why not inline those?
Inline all tiny linear bodies into the CTT
Context Threading 19
Overview
✔Motivation✔Background: The Context Problem✔Existing Solutions✔Our Approach✔Inlining
• Results
Context Threading 20
Experimental Setup
• Two Virtual Machines on two hardware architectures.• VM: Java/SableVM, OCaml interpreter
• Compare against direct threaded SableVM
• SableVM distro uses selective inlining
• Arch: P4, PPC
• Branch Misprediction
• Execution Time
Is our technique effective and general?
Context Threading 21
Mispredicted Taken Branches
0
0.25
0.50
0.75
1.00
compr
ess db jac
kjav
ac jess
mpeg
mtrt ray
scim
ark
soot
Subroutine Branch InliningTiny Inlining
Nor
mal
ized
to
Dir
ect T
hrea
ding
95% mispredictions eliminated on averageSableVm/Java Pentium 4
Context Threading 22
Execution time
0
0.25
0.50
0.75
1.00
compr
ess db jac
kjav
ac jess
mpeg
mtrt ray
scim
ark
soot
Subroutine Branch Inlining Tiny Inlining
Nor
mal
ized
to
Dir
ect T
hrea
ding
27% average reduction in execution time
Pentium 4
Context Threading 23
Execution Time (geomean)
0
0.25
0.50
0.75
1.00
java/p
4
java/p
pc
ocam
l/p4
ocam
l/ppc
Subroutine Branch InliningTiny Inlining
Nor
mal
ized
to
Dir
ect T
hrea
ding
Our technique is effective and general
Context Threading 24
Conclusions
•Context Problem: branch mispredictions due to mismatch between native and virtual control flow
•Solution: Generate control flow code into the Context Threading Table
•Results•Eliminate 95% of branch mispredictions•Reduce execution time by 30-40%‣recent, post CGO 2005, work follows
Context Threading
What about Scripting Languages?• Recently ported context
threading to TCL.• 10x cycles executed per
bytecode dispatched.• Much lower dispatch
overhead.• Speedup due to
subroutine threading, approx. 5%.
• TCL conference 2005
25
100
101
102
103
104
105
Tcl or Ocaml Benchmark
Cyc
les p
er
Dis
pa
tch
TclOcaml
Cyc
les
per
virt
ual i
nstr
uctio
n