ee 382n guest lecture wish branches

36
EE 382N Guest Lecture Wish Branches Hyesoon Kim HPS Research Group The University of Texas at Austin

Upload: hieu

Post on 22-Jan-2016

34 views

Category:

Documents


0 download

DESCRIPTION

EE 382N Guest Lecture Wish Branches. Hyesoon Kim HPS Research Group The University of Texas at Austin. Lecture Outline. Predicated execution Wish branches 2D-profiling. Motivation. Branch predictors are still not perfect . - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: EE 382N Guest Lecture Wish Branches

EE 382N Guest LectureWish Branches

Hyesoon Kim

HPS Research Group The University of Texas at Austin

Page 2: EE 382N Guest Lecture Wish Branches

2

EE382N Guest Lecture 4.10.2006

Lecture Outline

Predicated execution Wish branches 2D-profiling

Page 3: EE 382N Guest Lecture Wish Branches

3

EE382N Guest Lecture 4.10.2006

Motivation

Branch predictors are still not perfect.

Deeper pipeline and larger instruction

window increase the branch

misprediction penalty.

Predicated execution can eliminate

branch misprediction by converting

control-dependency to data dependency.

However, predicated code has overhead.

Page 4: EE 382N Guest Lecture Wish Branches

4

EE382N Guest Lecture 4.10.2006

Predicated Execution

Convert control flow dependency to data dependencyPro: Eliminate hard-to-predict branches

(normal branch code)

C B

D

AT N

p1 = (cond) branch p1, TARGET

mov b, 1 jmp JOIN

TARGET: mov b, 0

A

B

C

B

C

D

A

(predicated code)

A

B

C

if (cond) { b = 0;}else { b = 1;}

Cons: (1) Fetch blocks B and C all the time (2) Wait until p1 is resolved

Dadd x, b, 1

p1 = (cond)

(!p1) mov b, 1

(p1) mov b, 0

Page 5: EE 382N Guest Lecture Wish Branches

5

EE382N Guest Lecture 4.10.2006

p1 = (cond)

(!p1) mov b, 1

(p1) mov b, 0

2.02

0

0.2

0.4

0.6

0.8

1

1.2

gzip vpr mcf crafty parser gap vortex bzip2 twolf AVG

No

rma

lize

d e

xe

cu

tio

n t

ime

PREDICATED CODE

NO-DEPENDENCY

NO-DEPENDENCY + NO-FETCH

2.02

0

0.2

0.4

0.6

0.8

1

1.2

gzip vpr mcf crafty parser gap vortex bzip2 twolf AVG

No

rma

lize

d e

xe

cu

tio

n t

ime

PREDICATED CODE

NO-DEPENDENCY

NO-DEPENDENCY + NO-FETCH

2.02

0

0.2

0.4

0.6

0.8

1

1.2

gzip vpr mcf crafty parser gap vortex bzip2 twolf AVG

No

rma

lize

d e

xe

cu

tio

n t

ime

PREDICATED CODE

NO-DEPENDENCY

NO-DEPENDENCY + NO-FETCH

The Overhead of Predicated Execution

If all overhead is ideally eliminated, predicated execution would

provide 16% improvement in average execution time

A

B

C

(Predicated code)

D add x, b, 1

non-predicated

p1 = (cond)

(0) mov b,1

(1) mov b,0

-2%13%16%

Page 6: EE 382N Guest Lecture Wish Branches

6

EE382N Guest Lecture 4.10.2006

The Problem

Due to the predication overhead, predicated execution sometimes reduces

performance

Branch misprediction characteristics are dependent on run-time behavior: input set,

control-flow path and phase behavior. The compiler cannot accurately

estimate the run-time behavior of branches

Page 7: EE 382N Guest Lecture Wish Branches

7

EE382N Guest Lecture 4.10.2006

Predicated Code Performance vs. Branch Misprediction Rate

Normal branch code performs better

Predicated code performs better

run-time (input B) profile-time (input A)

Converting a branch to predicated code could hurt performance if run-time misprediction rate is lower than profile-time misprediction rate

X

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Branch misprediction rate (%)

Execu

tio

n t

ime (

cycle

s)

predicated code

normal branch code

Execution time(normal branch code) = exec_T * P(T) + exec_N * P(N) + misp_penalty * P(misprediction)

Execution time of predicated code = exec_pred

C B

D

AT N

B

C

D

A

Page 8: EE 382N Guest Lecture Wish Branches

8

EE382N Guest Lecture 4.10.2006

Lecture Outline

Predicated execution Wish branches 2D-profiling

Page 9: EE 382N Guest Lecture Wish Branches

9

EE382N Guest Lecture 4.10.2006

Wish Branches [Kim et al. Micro-38]

A new type of control flow instruction 3 types: wish jump/join and wish loop

The compiler generates code (with wish branches) that can be executed either as predicated code or non-predicated code (normal branch code)

The hardware decides to execute predicated code or normal branch code at run-time based on the confidence of branch prediction

Easy to predict: normal branch code Hard to predict: predicated code

Page 10: EE 382N Guest Lecture Wish Branches

10

EE382N Guest Lecture 4.10.2006

TARGET: (p1) mov b,0TARGET: (1) mov b,0

(!p1) mov b,1 wish.join !p1 Join

(1) mov b,1 wish.join (1) Join

Low ConfidenceWish Jump/Join

p1 = (cond) branch p1, TARGET

C B

D

AT N

mov b, 1 jmp JOIN

TARGET: mov b,0

normal branch code

A

B

C

B

C

D

A

p1 = (cond)

(!p1) mov b,1

(p1) mov b,0

predicated code

A

B

C

wish jump/join code

B

A

C

D

wish jump

p1=(cond) wish.jump p1 TARGET

A

B

C

wish join

DJOIN:

High Confidence

nop

nop

Taken

Not-Taken

Page 11: EE 382N Guest Lecture Wish Branches

11

EE382N Guest Lecture 4.10.2006

Low Confidence

Wish Loop

X

Y

N

T

LOOP: add a, a, 1 add i, i, 1 p1 = (i<N) branch p1, LOOP

EXIT:

X

Y

N

T

H

mov p1, 1

LOOP: (p1) add a, a, 1 (p1) add i, i, 1 (p1) p1 = (cond) wish. loop p1, LOOP

EXIT:

normal backward branch code

do {

a++;

i++;

} while (i<N);

XH

X

wish loop code

Y Y

High Confidence

(1)(1)(1)

Page 12: EE 382N Guest Lecture Wish Branches

12

EE382N Guest Lecture 4.10.2006

Mispredicted Case 1: Early-Exit

X1 X2 X3 Y

T T N

Correct execution:

Early-exit:

(Low confidence)

X1 X2

T

Y

N

X3 Y

N

Flush pipeline

Compared to normal branch code: predicate data dependency and one extra instruction (-)

X

Y

N

T

H

H

H

Page 13: EE 382N Guest Lecture Wish Branches

13

EE382N Guest Lecture 4.10.2006

Mispredicted Case 2: Late-Exit

X1 X2 X3 Y

T T N

Correct execution:

Late-exit:

(Low confidence)

X1 X2

T

X3

T

Compared to normal branch code: pro: reduce flush penalty (+++)

cons: predicate data dependency and one extra instruction (-)

T

X4

T

X5

N

Y …nop nopX

Y

N

T

H

H

H

Page 14: EE 382N Guest Lecture Wish Branches

14

EE382N Guest Lecture 4.10.2006

Mispredicted Cases3: No-Exit

X1 X2 X3 Y

T T N

Correct execution:

Late-exit: X1 X2

T

X3

T

No-Exit: predicate data dependency and one extra instruction (-)

T

X4

T

X5

N

Y …nop nopX

Y

N

T

H

H

H

No-exit: X1 X2

T

X3

T T

X4

T

X5

T

X6 …HFlush pipeline

Y

Page 15: EE 382N Guest Lecture Wish Branches

15

EE382N Guest Lecture 4.10.2006

Questions?

What kind of branches should be converted to wish branches (jump/join)?

Why not all branches?

What kind of branches should be converted to wish loops?

Page 16: EE 382N Guest Lecture Wish Branches

16

EE382N Guest Lecture 4.10.2006

Advantages/Disadvantages of Wish Branches

Advantages compared to predicated execution Reduce the overhead of predication Increase the benefits of predicated code by

allowing the compiler to generate more aggressively-predicated code

Provide a mechanism to exploit predication to reduce the branch misprediction penalty for backward branches (Wish loops)

Make predicated code less dependent on machine configuration (e.g. branch predictor)

Page 17: EE 382N Guest Lecture Wish Branches

17

EE382N Guest Lecture 4.10.2006

Advantages/Disadvantages of Wish Branches

Disadvantages compared to predicated execution

Extra branch instructions use machine resources

Extra branch instructions increase the contention for branch predictor table entries

May constrain the compiler’s scope for code optimizations

Page 18: EE 382N Guest Lecture Wish Branches

18

EE382N Guest Lecture 4.10.2006

Wish Branch Support

ISA Supportpredicated execution, wish branch instruction

Compiler SupportWish branch generation algorithms

The compiler needs to decide which branches are predicated, which are converted to wish branches, and which stay as normal branches

Hardware SupportInstruction decode logic Predicate dependency elimination moduleConfidence estimatorFront-end and branch misprediction detection/recovery module

Page 19: EE 382N Guest Lecture Wish Branches

19

EE382N Guest Lecture 4.10.2006

ISA Support

Using existing hint bits (IA-64, x86, PowerPC) Hint bits can be ignored. A wish branch can be

treated as a normal branch.

OPCODE btype wtype target offset p

btye: branch type (0:normal branch 1:wish branch)

wtype: wish branch type (0:jump 1:loop 2:join)

p: predicate register identifier

Page 20: EE 382N Guest Lecture Wish Branches

20

EE382N Guest Lecture 4.10.2006

Wish Branch Support

ISA Supportpredicated execution, wish branch instruction

Compiler SupportWish branch generation algorithms

The compiler needs to decide which branches are predicated, which are converted to wish branches, and which stay as normal branches

Hardware SupportInstruction decode logic Predicate dependency elimination moduleConfidence estimatorFront-end and branch misprediction detection/recovery module

Page 21: EE 382N Guest Lecture Wish Branches

21

EE382N Guest Lecture 4.10.2006

Compiler Support

Major phase ordering with wish branch generation in code generation [ORC]

region formation

if-conversion

loop opt (swp, unrolling)

global inst. sched

register allocation

local inst. sched

edge/value profiling

wish jump conversion

if-conversion

wish join insertion

wish loop conversion

loop opt

select candidates

cost-benefit analysis

predicate selected blocks

branch elimination

modifiednew

existing

Page 22: EE 382N Guest Lecture Wish Branches

22

EE382N Guest Lecture 4.10.2006

Wish Branch Generation Algorithm

wish jump/join candidates: all branch which are suitable for if-conversion The number of instructions in the fall-

through block > N (N=5) : wish jump and join are inserted

All other branches converted to predicated code

A loop branch is converted into a wish loop: when the loop body has fewer than L instructions (L=30)

Page 23: EE 382N Guest Lecture Wish Branches

23

EE382N Guest Lecture 4.10.2006

Wish Branch Support

ISA Supportpredicated execution, wish branch instruction

Compiler SupportWish branch generation algorithms

The compiler needs to decide which branches are predicated, which are converted to wish branches, and which stay as normal branches

Hardware SupportInstruction decode logic Predicate dependency elimination moduleFront-end and branch misprediction detection/recovery module

Confidence estimator

Page 24: EE 382N Guest Lecture Wish Branches

24

EE382N Guest Lecture 4.10.2006

Hardware Support Instruction Fetch/decode logic

Decoder: decode wish branchesBTB: mark wish branches

Wish branch state machine hardware Wish loop stays as low-confidence mode until the

loop exits Predicate dependency elimination module

High-confidence mode: predicate values are predicted

Branch misprediction detection/recovery module No flush if wish branch is mispredicted during

low-confidence mode Confidence estimator

Page 25: EE 382N Guest Lecture Wish Branches

25

EE382N Guest Lecture 4.10.2006

Global BHR

JRS Confidence Estimator

Assigning Confidence to Conditional Branch Predictions [Jacobsen et al. Micro-29]

PC

+ 2^m entries

m bits

n bit Counters

Estimate how much confidence the processor has in a branch prediction

Trained with branch misprediction information

High Confidence Low Confidence

> th?

Page 26: EE 382N Guest Lecture Wish Branches

26

EE382N Guest Lecture 4.10.2006

Experimental Infrastructure

IA-64 provides full support for predication Convert IA-64 traces to micro-ops to simulate

an out-of-order superscalar processor model

IA-64Compiler

(ORC)

SourceCode

IA-64 Binary

IA-64 Trace µopsTrace

generationmodule

Micro-opTranslator

Micro-opSimulator

Page 27: EE 382N Guest Lecture Wish Branches

27

EE382N Guest Lecture 4.10.2006

Simulation Methodology

Nine SPEC 2000 integer benchmarks Baseline Processor Configuration

Front End Large and accurate branch predictor (64KB

hybrid branch predictor: gshare + local) Minimum 30-cycle branch misprediction penalty 64KB, 2-cycle latency I-cache

Execution Core 8-wide out-of-order processor 512-entry instruction window

Confidence Estimator 1KB tagged 16-bit history JRS confidence

estimator (Jacobsen et al. MICRO-29)

Page 28: EE 382N Guest Lecture Wish Branches

28

EE382N Guest Lecture 4.10.2006

2.02

0

0.2

0.4

0.6

0.8

1

1.2

gzip vpr mcf crafty parser gap vortex bzip2 twolf AVG AVGnomcf

No

rma

lize

d e

xecu

tion

tim

e.

SELECTIVE-PREDICATIONAGGRESSIVE-PREDICATIONwish jump/joinwish jump/join/loop

SELECTIVE-PREDICATION: branches are selectively predicated using

compile-time cost-benefit analysis

AGGRESSIVE-PREDICATION: all branches that are suitable for if-

conversion are predicated

16% over conditional branch prediction (w/o mcf)

11% over selective-predication (w/o mcf)

7 % over aggressive predication (w/o mcf)

14% over conditional branch prediction and

13% over selective-predication and

16% over aggressive-predication

12% over conditional branch prediction

11% over selective-predication

13 % over aggressive predication

2.02

0

0.2

0.4

0.6

0.8

1

1.2

gzip vpr mcf crafty parser gap vortex bzip2 twolf AVG AVGnomcf

No

rma

lize

d e

xecu

tion

tim

e.

SELECTIVE-PREDICATIONAGGRESSIVE-PREDICATIONwish jump/joinwish jump/join/loop

2.02

0

0.2

0.4

0.6

0.8

1

1.2

gzip vpr mcf crafty parser gap vortex bzip2 twolf AVG AVGnomcf

No

rma

lize

d e

xecu

tion

tim

e.

SELECTIVE-PREDICATIONAGGRESSIVE-PREDICATIONwish jump/joinwish jump/join/loop

2.02

0

0.2

0.4

0.6

0.8

1

1.2

gzip vpr mcf crafty parser gap vortex bzip2 twolf AVG AVGnomcf

No

rma

lize

d e

xecu

tion

tim

e.

SELECTIVE-PREDICATIONAGGRESSIVE-PREDICATIONwish jump/joinwish jump/join/loop

Performance Improvement

24% 8% 14%-4%

non-predicated

2.02

Page 29: EE 382N Guest Lecture Wish Branches

29

EE382N Guest Lecture 4.10.2006

Wish Branch: Conclusion

New control flow instructions: wish branches (jump/join/loop)

Wish branches improve performance by dividing the work of

predication between the compiler and the microarchitecture Compiler: analyzes the control-flow graph and generates code

Microarchitecture: makes run-time decision to use predication

Wish branches provide significant performance benefits 16% compared to conditional branch prediction

13% compared to selectively predicated code

Wish branches can make predicated execution more viable

and effective in high performance processors By enabling adaptive and aggressive predicated execution

Page 30: EE 382N Guest Lecture Wish Branches

30

EE382N Guest Lecture 4.10.2006

Lecture Outline

Predicated execution Wish branches 2D-profiling

Page 31: EE 382N Guest Lecture Wish Branches

31

EE382N Guest Lecture 4.10.2006

2D-profiling Goal: Identify input-dependent branches by

using a single input set for profiling

If We Know a Branch is Input-Dependent May not convert it to predicated code.

May convert it to a wish branch.

May not perform other compiler optimizations or may

perform them less aggressively.

Hot-path/trace/superblock-based optimizations

[Fisher’81, Pettis’90, Hwu’93, Merten’99]

Page 32: EE 382N Guest Lecture Wish Branches

32

EE382N Guest Lecture 4.10.2006

0

0.2

0.4

0.6

0.8

1

0 500 1000 1500Time (in terms of number of executed instructions x 100M)

Bra

nc

h P

red

icti

on

Ac

cu

rac

y

0

0.2

0.4

0.6

0.8

1

0 500 1000 1500Time (in terms of number of executed instructions x 100M)

Bra

nc

h P

red

icti

on

Ac

cu

rac

y

Key Insight of 2D-profiling

Phase behavior in prediction accuracyis a good indicator of input dependence

input-dependent

input-independent

phase 1

phase 2

phase 3

Page 33: EE 382N Guest Lecture Wish Branches

33

EE382N Guest Lecture 4.10.2006

Traditional Profiling

brA

time

brB

time

MEAN pr.Acc(brA) MEAN pr.Acc(brB)

behavior of brA behavior of brB

MEAN pr.Acc(brA)

MEAN pr.Acc(brB)

pr.

Acc

pr.

Acc

Page 34: EE 382N Guest Lecture Wish Branches

34

EE382N Guest Lecture 4.10.2006

2D-profiling

brA

time

brB

timeMEAN pr.Acc(brA) MEAN pr.Acc(brB)

STD pr.Acc(brA) ≠ STD pr.Acc(brB)

behavior of brA ≠ behavior of brB

A: input-dependent br, B: input-independent br

MEAN pr.Acc(brA)

STD pr.Acc(brA)

MEAN pr.Acc(brB)

STD pr.Acc(brB)pr.

Acc

pr.

Acc

Page 35: EE 382N Guest Lecture Wish Branches

35

EE382N Guest Lecture 4.10.2006

Calculate MEAN (brA, brB, …), Standard deviation (brA, brB, …),

PAM:Points Above Mean (brA, brB, …)

2D-profiling Mechanism

Slice 1 Slice 2 Slice N …

mean Pr.Acc(brA,s1)

mean Pr.Acc(brB,s1)

mean Pr.Acc(brA,s2)

mean Pr.Acc(brB,s2)

mean Pr.Acc(brA,sN)

mean Pr.Acc(brB,sN)...

...

The profiler collects branch prediction accuracy information for every static branch over time

mean brA

time

PAM:50%

PAM:0%

brA

brBmean brB

... ......

slice size = M instructions

Page 36: EE 382N Guest Lecture Wish Branches

36

EE382N Guest Lecture 4.10.2006

2D-profiling: Conclusion & Future Work

2D-profiling is a new profiling technique to find input-dependent characteristics by using a single input data set for profiling

2D-profiling uses time-varying information instead of just average data

Phase behavior in prediction accuracy in a profile run input-dependent

Future Work: Better predicated code/wish branch generation algorithms Detecting other input-dependent program characteristics