context threading: a flexible and efficient dispatch...

25
1 Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl Benjamin Vitale Mathew Zaleski Angela Demke Brown

Upload: others

Post on 20-Apr-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

1Research supported by IBM CAS, NSERC, CITO

Context Threading: A flexible and efficient dispatch technique for

virtual machine interpreters

Marc BerndlBenjamin VitaleMathew Zaleski

Angela Demke Brown

Context Threading 2

Interpreter performance

• Why not just in time (JIT) compile?• High performance JVMs still interpret

• People use interpreted languages that don’t yet have JITs

• They still want performance!

• 30-40% of execution time is due to stalls caused by branch misprediction.

• Our technique eliminates 95% of branch mispredictions

Context Threading 3

Overview

✔Motivation

•Background: The Context Problem

•Existing Solutions

•Our Approach

•Inlining

•Results

Context Threading

load

4

A Tale of Two Machines

LoadedProgram

VirtualProgram

Return Address

Wayness(Conditional)

Execution Cycle

BytecodeBodies

Pipeline

Target Address(Indirect)

Pre

dicto

rs

Execution Cycle

Virtual Machine Interpreter

Real MachineCPU

Context Threading 5

Interpreter

LoadedProgram

Bytecodebodies

Internal Representation

fetch

dispatchLoadParms

execute

Execution Cycle

Context Threading

0: iconst_0 1: istore_1 2: iload_1 3: iload_1 4: iadd 5: istore_1 6: iload_1 7: bipush 64 9: if_icmplt 2 12: return

6

Running Java Example

void foo(){ int i=1; do{ i+=i; } while(i<64); }

Java Source Java Bytecode

Javac compiler

Context Threading

while(1){ opcode = *vPC++; switch(opcode){

//and many more..

}};

7

Switched Interpreter

case iload_1: ..

break;

case iadd: ..

break;

slow. burdened by switch and loop overhead

Context Threading

“Threading” Dispatch

‣ No switch overhead. Data driven indirect branch.8

execution of virtual program

“threads” through bodies

(as in needle & thread)

iload_1: ..goto *vPC++;

iadd: ..goto *vPC++;

istore: ..goto *vPC++;

0: iconst_0 1: istore_1 2: iload_1 3: iload_1 4: iadd 5: istore_1 6: iload_1 7: bipush 64 9: if_icmplt 2 12: return

Context Threading

0: iconst_0 1: istore_1 2: iload_1 3: iload_1 4: iadd 5: istore_1 6: iload_1 7: bipush 64 9: if_icmplt 2 12: return

Context Problem

‣ Data driven indirect branches hard to predict9

iload_1: ..goto *vPC++;

iadd: ..goto *vPC++;

istore: ..goto *vPC++;

indirect branch predictor

(micro-arch)

Context Threading 10

Direct Threaded Interpreter

-7

&&if_icmplt64&&bipush&&iload_1&&istore_1&&iadd&&iload_1&&iload_1

…iload_1iload_1iaddistore_1iload_1bipush 64if_icmplt 2…

DTT - DirectThreading Table

VirtualProgram

vPC iload_1: ..goto *vPC++;

iadd: ..goto *vPC++;

Target of computed goto is data-driven

C implementationof each body

istore: ..goto *vPC++;

Context Threading 11

Existing Solutions

BodyBodyBodyBodyBody

GOTO *PC

????

Piumarta & Ricardi :Bodies Replicated

Super InstructionReplicate

iload_1goto *pc

1

iload_1goto *pc

2

1

1

2

2

Ertl & Gregg:Bodies and Dispatch

Replicated

Limited to relocatable virtual instructions

Context Threading 12

Overview

✔Motivation✔Background: The Context Problem✔Existing Solutions

• Our Approach

•Inlining

•Results

Context Threading 13

Key Observation

• Virtual and native control flow similar• Linear or straight-line code

• Conditional branches

• Calls and Returns

• Indirect branches

• Hardware has predictors for each type• Direct uses indirect branch for everything!

‣ Solution: Leverage hardware predictors

Context Threading 14

Essence of our Solution

iload_1: ..ret;

iadd: ..ret;

..call iload_1call istore_1call iaddcall iload_1call iload_1

CTT - ContextThreading Table

(generated code)

Bytecode bodies (ret terminated)

Return Branch Predictor Stack

…iload_1iload_1iaddistore_1iload_1bipush 64if_icmplt 2…

Package bodies as subroutines and call them

Context Threading 15

Subroutine Threading

iload_1: …ret;

iadd: …ret;

call bipush call if_icmplt

call iload_1 call istore_1 call iadd call iload_1 call iload_1

CTT load timegenerated code

Bytecode bodies (ret terminated)

if_cmplt: …goto *vPC++;

virtual branch instructions as before

…iload_1iload_1iaddistore_1iload_1bipush 64if_icmplt 2… 64

-7

DTT containsaddresses in CTT

vPC

Context Threading 16

The Context Threading Table

• A sequence of generated call instructions

• Good alignment of virtual and hardware control flow for straight-line code.

‣ Can virtual branches go into the CTT?

Context Threading 17

Specialized Branch Inlining

Conditional Branch

Predictor now mobilized

……target:

…call …call iload_1

if(icmplt) goto target:

Branch Inlined Into the CTT

5

DTT

vPC

target:…

Inlining conditional branches provides context

Context Threading 18

Tiny Inlining

• Context Threading is a dispatch technique • But, we inline branches

• Some non-branching bodies are very small• Why not inline those?

Inline all tiny linear bodies into the CTT

Context Threading 19

Overview

✔Motivation✔Background: The Context Problem✔Existing Solutions✔Our Approach✔Inlining

• Results

Context Threading 20

Experimental Setup

• Two Virtual Machines on two hardware architectures.• VM: Java/SableVM, OCaml interpreter

• Compare against direct threaded SableVM

• SableVM distro uses selective inlining

• Arch: P4, PPC

• Branch Misprediction

• Execution Time

Is our technique effective and general?

Context Threading 21

Mispredicted Taken Branches

0

0.25

0.50

0.75

1.00

compr

ess db jac

kjav

ac jess

mpeg

mtrt ray

scim

ark

soot

Subroutine Branch InliningTiny Inlining

Nor

mal

ized

to

Dir

ect T

hrea

ding

95% mispredictions eliminated on averageSableVm/Java Pentium 4

Context Threading 22

Execution time

0

0.25

0.50

0.75

1.00

compr

ess db jac

kjav

ac jess

mpeg

mtrt ray

scim

ark

soot

Subroutine Branch Inlining Tiny Inlining

Nor

mal

ized

to

Dir

ect T

hrea

ding

27% average reduction in execution time

Pentium 4

Context Threading 23

Execution Time (geomean)

0

0.25

0.50

0.75

1.00

java/p

4

java/p

pc

ocam

l/p4

ocam

l/ppc

Subroutine Branch InliningTiny Inlining

Nor

mal

ized

to

Dir

ect T

hrea

ding

Our technique is effective and general

Context Threading 24

Conclusions

•Context Problem: branch mispredictions due to mismatch between native and virtual control flow

•Solution: Generate control flow code into the Context Threading Table

•Results•Eliminate 95% of branch mispredictions•Reduce execution time by 30-40%‣recent, post CGO 2005, work follows

Context Threading

What about Scripting Languages?• Recently ported context

threading to TCL.• 10x cycles executed per

bytecode dispatched.• Much lower dispatch

overhead.• Speedup due to

subroutine threading, approx. 5%.

• TCL conference 2005

25

100

101

102

103

104

105

Tcl or Ocaml Benchmark

Cyc

les p

er

Dis

pa

tch

TclOcaml

Cyc

les

per

virt

ual i

nstr

uctio

n