hardware-based devirtualization (vpc prediction) hyesoon kim, jose a. joao, onur mutlu ++, chang joo...

52
Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++ , Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

Upload: tayler-mixer

Post on 14-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Hardware-based Devirtualization (VPC Prediction)

Hyesoon Kim, Jose A. Joao, Onur Mutlu++, Chang Joo Lee, Yale N. Patt, Robert Cohn*

++ *

2

Outline

Background and Motivation

VPC (Virtual Program Counter) Prediction

Results

Conclusion

3

Direct vs. Indirect Branch

TARG A+1

AT N

A

?

Conditional (Direct) Branch Indirect Branch

Indirect branches are costly on processor performance Much more difficult to predict than conditional (direct) branches: multiple target addresses Indirect branch predictor requires a large structure

br.cond TARGET R1 = MEM[R2]branch R1

4

Source code: Shape *s = …; a = s->area(); // virtual function call

Static assembly code: R1 = MEM[R2] // function address lookup call R1 // a register-indirect call

Source Code Examples

Switch structures

Virtual function calls

5

Indirect Branch Mispredictions

Data from Intel Core Duo processor

0

2

4

6

8

10

12

14

16

iexpl

orer

firef

oxvtu

ne

cygw

in

emac

s

acro

read

winexp

lorer

desk

top-

sear

ch

outlo

okex

cel

simics

winam

pav

ida

windvd

nasa

-wor

ldwind

pptvi

ew

sqlse

rvrAVG

MP

KI

direct

indirect

6

Direct Branch? Indirect Branch?

TARG2

TARG1

PC+1

Branch PredictorDirectionPredictor

Branch Target Buffer (BTB)

Indirect Branch Predictor

..1001010Hash

GHR

PC Addr 0x0800

TARG2 Predicted target

T

7

Outline

Background and Motivation

VPC (Virtual Program Counter) Prediction

Results

Conclusion

8

VPC Prediction: Basic Idea

Key idea: Treat an indirect branch as multiple “virtual” conditional branches Only for prediction purposes

Use the conditional branch predictor

9

TARG2

TARG1

VPC Branch PredictorDirectionPredictor

Branch Target Buffer

..1001010Hash

GHR

PC Addr 0x0800

VPC2 VPC1

Predicted target

10

VPC Prediction: Basic Idea Key idea: Treat an indirect branch as

multiple “virtual” conditional branches Only for prediction purposes

Use the conditional branch predictor

Benefits: No separate complex structure Can be applied to any other conditional branch

prediction algorithm Improve conditional branch prediction algorithm

Will improve the indirect branch prediction accuracy

11

Inspiration: Static Devirtualization

Source code:

Shape *s = …;

a = s->area(); // an indirect call

Optimized source code: Shape *s = …;

if (s->type == Rectangle) // a conditional branch at PC: X a = Rectangle::area(); else if (s->type == Circle) // a conditional branch at PC: Y a = Circle::area(); else

a = s->area(); // an indirect call at PC: Z

Small talk(’84), Calder and Grunwald (’94), Garret et al. (’94) , Ishizaki et al.(’00)

12

VPC Prediction Source code: Shape *s = …; a = s->area(); // an indirect call

Static assembly code: R1 = MEM[R2] call R1 // PC: L

Dynamic virtual branches (for prediction purposes): conditional jump TARGET1 // virtual PC = L conditional jump TARGET2 // virtual PC = L XOR HASHVAL[1] conditional jump TARGET3 // virtual PC = L XOR HASHVAL[2] conditional jump TARGET4 // virtual PC = L XOR HASHVAL[3]

13

Virtual PC Address Generation

Use original PC address and iteration counter value

0xabcd

0x018a

0x7a9c

0x…

iteration counter value

PC

Virtual PC

Hash value table

14

VPC Prediction Process-I

1111

L

PC

GHR

Direction Predictor

BTB

not taken

TARG1

cond. jump TARG1 // VPC: L cond. jump TARG2 // VPC: VL2 cond. jump TARG3 // VPC: VL3 cond. jump TARG4 // VPC: VL4

call R1 // PC: L Real Instruction

Virtual Instructions

Next iteration

15

VPC Prediction Process-II

1110

VL2

VPC

VGHR

BTB

not taken

TARG2

cond. jump TARG1 // VPC: L cond. jump TARG2 // VPC: VL2 cond. jump TARG3 // VPC: VL3 cond. jump TARG4 // VPC: VL4

call R1 // PC: L Real Instruction

Virtual Instructions

Direction Predictor

Next iteration

16

VPC Prediction Process-III

cond. jump TARG1 // VPC: L cond. jump TARG2 // VPC: VL2 cond. jump TARG3 // VPC: VL3 cond. jump TARG4 // VPC: VL4

call R1 // PC: L Real Instruction

Virtual Instructions

1100

VL3

VPC

VGHR

BTB

taken

TARG3

Direction Predictor

Predicted Target = TARG3

17

VPC Prediction Algorithm Access the conditional branch predictor and the BTB

with VPCA and VGHR

Compute VPCA and VGHR for the next iteration VPCA = PC XOR HASHVAL[iter] VGHR = VGHR << 1

Predicted not taken: Move to the next iteration

Predicted taken: Use the target in the BTB as the target of an indirect branch

Give up and stall if Iteration count > MAX_ITER or BTB miss

18

VPC Training Algorithm An iterative process when an indirect branch is

retired (not on the critical path)

Update the conditional branch predictor Virtual branch has a correct target: Taken Virtual branch has a wrong target: Not-taken

Update replacement policy bits of the correct target in the BTB

Insert the correct target into the BTB Conditional branch predictor: taken Replace the least frequently used target (LFU)

19

Iteration counter

Hardware Cost and Complexity

GHR VGHR BranchDirection Predictor

(BP)

PC

Hash Function

VPCABTB

+

Taken/Not Taken

Predict?

Direct/Indirect

Target Address

20

Outline

Background and Motivation

VPC Prediction

Results

Conclusion

21

Simulation Methodology Pin-based x86 Simulator

Processor configuration 4K-entry BTB 64KB perceptron conditional branch predictor Minimum 30-cycle branch misprediction penalty 8-wide, 512-entry instruction window Less aggressive processor (in the paper) Gshare, O-GEHL conditional branch predictors

Indirect branch intensive benchmarks 5 SPEC CPU2000, 5 SPEC CPU 2006, 2 other C++ IBM server benchmarks (OLTP) (in the paper)

22

VPC MPKI

0

2

4

6

8

10

12

14

16

gcc

craf

tyeo

n

perlb

mk

gap

perlb

ench

gcc0

6sje

ng

nam

d

povr

ay

richa

rds ixx

AVG

Ind

ire

ct

bra

nc

h M

isp

red

icti

on

s (

MP

KI)

baselineVPC-ITER-2VPC-ITER-4VPC-ITER-6VPC-ITER-8VPC-ITER-10VPC-ITER-12VPC-ITER-14VPC-ITER-16

0

2

4

6

8

10

12

14

16

gcc

craf

tyeo

n

perlb

mk

gap

perlb

ench

gcc0

6sje

ng

nam

d

povr

ay

richa

rds ixx

AVG

Ind

ire

ct

bra

nc

h M

isp

red

icti

on

s (

MP

KI)

baselineVPC-ITER-2VPC-ITER-4VPC-ITER-6VPC-ITER-8VPC-ITER-10VPC-ITER-12VPC-ITER-14VPC-ITER-16

0

2

4

6

8

10

12

14

16

gcc

craf

tyeo

n

perlb

mk

gap

perlb

ench

gcc0

6sje

ng

nam

d

povr

ay

richa

rds ixx

AVG

Ind

ire

ct

bra

nc

h M

isp

red

icti

on

s (

MP

KI)

baselineVPC-ITER-2VPC-ITER-4VPC-ITER-6VPC-ITER-8VPC-ITER-10VPC-ITER-12VPC-ITER-14VPC-ITER-16

0

2

4

6

8

10

12

14

16

gcc

craf

tyeo

n

perlb

mk

gap

perlb

ench

gcc0

6sje

ng

nam

d

povr

ay

richa

rds ixx

AVG

Ind

ire

ct

bra

nc

h M

isp

red

icti

on

s (

MP

KI)

baselineVPC-ITER-2VPC-ITER-4VPC-ITER-6VPC-ITER-8VPC-ITER-10VPC-ITER-12VPC-ITER-14VPC-ITER-16

23

0102030405060708090

100110

gcc

craf

tyeo

n

perlb

mk

gap

perlb

ench

gcc0

6sje

ng

nam

d

povr

ay

richa

rds ixx

AVG% I

PC

im

pro

ve

me

nt

ov

er

ba

se

lin

e

VPC-ITER-2VPC-ITER-4VPC-ITER-6VPC-ITER-8VPC-ITER-10VPC-ITER-12VPC-ITER-14VPC-ITER-16

VPC Performance

24

Different Direction Predictors

0

5

10

15

20

25

30

35

gshare perceptron O-GEHL

IPC

im

pro

vem

en

t (%

) 98% 98.3% 99%

Improving conditional branch prediction accuracy alsoimproves indirect branch prediction accuracy!

Con

ditio

nal b

ranc

h ac

cura

cy (

%)

25

VPC vs. Static Devirtualization Advantages

Enables other compiler optimizations (function inlining) Can reduce the number of mispredictions

Disadvantages/Limitations Not all indirect branches can be statically devirtualized Extensive static analysis/profiling Lack of adaptivity to run-time input set and phase behavior

VPC prediction can be used with statically devirtualized binaries 10% improvement on top of static devirtualization

26

Outline

Background and Motivation

VPC Prediction

Results

Conclusion

27

Conclusion

VPC dynamically converts indirect branches into multiple conditional branches; uses the existing conditional branch prediction hardware

VPC prediction reduces the branch misprediction penalty without significant extra hardware storage. Baseline: 26% IPC improvement O-GEHL: 31% IPC improvement

VPC can be an enabler encouraging programmers to use object-oriented programming styles

Thank you!

Questions?

29

VPC vs. Cascaded IBP

-20

0

20

40

60

80

100

120

gcc

craf

tyeo

n

perlb

mk

gap

perlb

ench

gcc0

6sje

ng

nam

d

povr

ay

richa

rds ixx

AVG

% I

PC

im

pro

ve

me

nt

ov

er

ba

se

lin

e

cascaded-704Bcascaded-1.4KBcascaded-2.8KBcascaded-5.5KBcascaded-11KBcascaded-22KBcascaded-44KBcascaded-88KBcascaded-176KBVPC-ITER-12

30

VPC vs. Other Indirect BP

gcc crafty eon perlbmk

TargetTag

Cache12KB 1.5KB >192KB 1.5KB

Cascaded >176KB 2.8KB >176KB 2.8KB

TTC: Chang et al. (’96)Cascaded: Driesen and Holzle(’98)

31

Iterative prediction

It doesn’t hurt performance significantly Results

Why? Most prediction is within a few iterations. Results

32

VPC Hit Iteration Counter

0%

20%

40%

60%

80%

100%

gcc

craf

tyeo

n

perlb

mk

gap

perlb

ench

gcc0

6sje

ng

nam

d

povr

ay

richa

rds ixx

AVG

11-12

9-10

7-8

5-6

4

3

2

1

33

Can the BTB be pipelined?

Yes The next iteration of VPC can be

started without knowing the previous iteration in the pipeline.

Consecutive VPC prediction iterations can be simply pipelined.

If the iteration is not needed then simply discard the prediction.

34

Is 4K-entry BTB too large?

Pentium 4 has a 4K-entry BTB IBM Z series (z990) has an 8K-entry

BTB AMD Athlon and Hammer have 2K-

entry BTBs

35

BTB Size Effects

0

1

2

3

4

5

6

7

8

512 1024 2048 4096

Ind

irec

t b

ran

ch M

isp

red

icti

on

s (M

PK

I)

0

5

10

15

20

25

30

35

40

% I

PC

im

pro

vem

ent

ove

r b

asel

inebase

vpc

IPC improvement

36

VPC Prediction Accuracy

0%

20%

40%

60%

80%

100%

gcc

craf

tyeo

n

perlb

mk

gap

perlb

ench

gcc0

6sje

ng

nam

d

povr

ay

richa

rds ixx

VP

C a

cc

es

s (

%)

no target

wrong target

correct

37

Target Distribution

0%

20%

40%

60%

80%

100%

gcc

craf

tyeo

n

perlb

mk

gap

perlb

ench

gcc0

6sje

ng

nam

d

povr

ay

richa

rds ixx

AVG

16+

11-15

6-10

5

4

3

2

1

38

VPC vs. Tagged Target Cache

0

20

40

60

80

100

120

gcc

craf

tyeo

n

perlb

mk

gap

perlb

ench

gcc0

6sje

ng

nam

d

povr

ay

richa

rds ixx

AVG% I

PC

im

pro

ve

me

nt

ov

er

ba

se

lin

e

TTC-384BTTC-768BTTC-1.5KBTTC-3KBTTC-6KBTTC-12KBTTC-24KBTTC-48KBTTC-96KBTTC-192KBVPC-ITER-12

39

VPC Prediction Delay Effects

0

20

40

60

80

100

120

gcc

craf

tyeo

n

perlb

mk

gap

perlb

ench

gcc0

6sje

ng

nam

d

povr

ay

richa

rds ixx

AVG

% I

PC

im

pro

ve

me

nt

ov

er

ba

se

lin

e

1br/cycle

2br/cycle4br/cycle

6br/cycle8br/cycle

10br/cycle

40

VPC with O-GEHL BP

0

20

40

60

80

100

120

gcc

craf

tyeo

n

perlb

mk

gap

perlb

ench

gcc0

6sje

ng

nam

d

povr

ay

richa

rds ixx

AVG% I

PC

im

pro

ve

me

nt

ov

er

ba

se

lin

e

TTC-384BTTC-768BTTC-1.5KBTTC-3KBTTC-6KBTTC-12KBTTC-24KBTTC-48KBVPC-ITER-12

41

VPC with a Less Aggressive Processor

0

10

20

30

40

50

60

70

gcc

craf

tyeo

n

perlb

mk

gap

perlb

ench

gcc0

6sje

ng

nam

d

povr

ay

richa

rds ixx

AVG% I

PC

im

pro

ve

me

nt

ov

er

ba

se

lin

e

TTC-384BTTC-768BTTC-1.5KBTTC-3KBTTC-6KBTTC-12KBTTC-24KBTTC-48KBVPC-ITER-12

42

Server Benchmarks

0

2

4

6

8

10

12

14

16

OLTP1 OLTP2 OLTP3 AVG

Ind

irec

t b

ran

ch M

isp

red

icti

on

s (M

PK

I)

baselineVPC-ITER-2VPC-ITER-4VPC-ITER-6VPC-ITER-8VPC-ITER-10VPC-ITER-12VPC-ITER-14VPC-ITER-16

43

Server Benchmarks (VPC vs. TTC)

0

2

4

6

8

10

12

14

16

18

OLTP1 OLTP2 OLTP3 AVG

Ind

irec

t b

ran

ch M

isp

red

icti

on

s (M

PK

I)

baselineTTC-384BTTC-768BTTC-1.5KBTTC-3KBTTC-6KBTTC-12KBTTC-24KBTTC-48KBVPC-ITER-10

44

VPC Prediction vs. Compiler-Based Devirtualization (With TTC)

-10

0

10

20

30

40

50

60

70

80

90

gcc

craf

tyeo

n

perlb

mk

gap

perlb

ench

gcc0

6sje

ng

nam

d

povr

ayAVG

% I

PC

im

pro

ve

me

nt

ov

er

ba

se

lin

e

TTC-384BTTC-768BTTC-1.5KBTTC-3KBTTC-6KBTTC-12KBTTC-24KBTTC-48KBVPC-ITER-12

45

Conditional Br. Prediction Effects

0

0.5

1

1.5

2

2.5

3

3.5

4

gshare perceptron O-GEHL

Co

nd

itio

na

l Br.

MP

KI

Base

VPC

VPC Prediction reduces the accuracy of direction branch prediction but not that much!

46

Indirect Branch Mispredictions

0

10

20

30

40

50

60

Pe

rce

nta

ge

of

all

mis

pre

dic

ted

bra

nc

he

s(%

)

indirect branches

47

VPC Prediction with Static Devirtualization

VPC prediction can be used with static devirtualized binaries. Not all indirect branches could be devirtualized

0

10

20

30

40

50

60

gcc

craf

tyeo

n

perlb

mk

gap

perlb

ench

gcc0

6sje

ng

nam

d

povr

ayAVG%

IP

C i

mp

rov

em

en

t o

ve

r b

as

eli

ne VPC-ITER-4

VPC-ITER-6

VPC-ITER-8

VPC-ITER-10

VPC-ITER-12

48

VPC Training: Correct Prediction

call R1 // PC: L Retirement: Real Instruction

Known: Correct predicted, predicted iter = 3

Iter VPCA VGHR Direction BP BTB

1 L GHR Not-taken -

2 VL2 GHR<<1 Not-taken -

3 VL3 GHR<<2 TakenUpdate

replacement

49

VPC Training: Misprediction

call R1 // PC: L Retirement: Real Instruction

Known: Mispredicted, correct target address

Iter VPCA VGHR BTB AccessTrain

Direction BPTrain BTB

1 L GHRTARG != Correct

Not-taken -

2 VL2 GHR<<1TARG != Correct

Not-taken -

3 VL3 GHR<<2Target = Correct

TakenUpdate

replacement

50

VPC Training: Misprediction

call R1 // PC: L Retirement: Real Instruction

Known: Mispredicted, correct target address

Iter VPCA VGHR BTB AccessTrain

Direction BPTrain BTB

1 L GHRTARG != Correct

Not-taken -

2 VL2 GHR<<1TARG != Correct

Not-taken -

3 VL3 GHR<<2TARG != Correct

Not-taken -

No Target

51

VPC Training: Misprediction

call R1 // PC: L Retirement: Real Instruction

Known: Mispredicted, correct target address

Iter VPCA VGHR BTB AccessRepl.

counterTrain BP

Train BTB

1 L GHRTARG != Correct

3Not-

taken-

2 VL2GHR<<

1TARG != Correct

1Not-

takenNothing

3 VL3GHR<<

2TARG != Correct

8Not-

taken-

Replacement

Taken Insert0

52

Does VPC need an extra BTB port?

No A read from the BTB is only needed

when a branch is mispredicted. 95% branches are correctly predicted

with VPC. The read is performed only there is a

available BTB port.