ee 382n guest lecture wish branches
DESCRIPTION
EE 382N Guest Lecture Wish Branches. Hyesoon Kim HPS Research Group The University of Texas at Austin. Lecture Outline. Predicated execution Wish branches 2D-profiling. Motivation. Branch predictors are still not perfect . - PowerPoint PPT PresentationTRANSCRIPT
EE 382N Guest LectureWish Branches
Hyesoon Kim
HPS Research Group The University of Texas at Austin
2
EE382N Guest Lecture 4.10.2006
Lecture Outline
Predicated execution Wish branches 2D-profiling
3
EE382N Guest Lecture 4.10.2006
Motivation
Branch predictors are still not perfect.
Deeper pipeline and larger instruction
window increase the branch
misprediction penalty.
Predicated execution can eliminate
branch misprediction by converting
control-dependency to data dependency.
However, predicated code has overhead.
4
EE382N Guest Lecture 4.10.2006
Predicated Execution
Convert control flow dependency to data dependencyPro: Eliminate hard-to-predict branches
(normal branch code)
C B
D
AT N
p1 = (cond) branch p1, TARGET
mov b, 1 jmp JOIN
TARGET: mov b, 0
A
B
C
B
C
D
A
(predicated code)
A
B
C
if (cond) { b = 0;}else { b = 1;}
Cons: (1) Fetch blocks B and C all the time (2) Wait until p1 is resolved
Dadd x, b, 1
p1 = (cond)
(!p1) mov b, 1
(p1) mov b, 0
5
EE382N Guest Lecture 4.10.2006
p1 = (cond)
(!p1) mov b, 1
(p1) mov b, 0
2.02
0
0.2
0.4
0.6
0.8
1
1.2
gzip vpr mcf crafty parser gap vortex bzip2 twolf AVG
No
rma
lize
d e
xe
cu
tio
n t
ime
PREDICATED CODE
NO-DEPENDENCY
NO-DEPENDENCY + NO-FETCH
2.02
0
0.2
0.4
0.6
0.8
1
1.2
gzip vpr mcf crafty parser gap vortex bzip2 twolf AVG
No
rma
lize
d e
xe
cu
tio
n t
ime
PREDICATED CODE
NO-DEPENDENCY
NO-DEPENDENCY + NO-FETCH
2.02
0
0.2
0.4
0.6
0.8
1
1.2
gzip vpr mcf crafty parser gap vortex bzip2 twolf AVG
No
rma
lize
d e
xe
cu
tio
n t
ime
PREDICATED CODE
NO-DEPENDENCY
NO-DEPENDENCY + NO-FETCH
The Overhead of Predicated Execution
If all overhead is ideally eliminated, predicated execution would
provide 16% improvement in average execution time
A
B
C
(Predicated code)
D add x, b, 1
non-predicated
p1 = (cond)
(0) mov b,1
(1) mov b,0
-2%13%16%
6
EE382N Guest Lecture 4.10.2006
The Problem
Due to the predication overhead, predicated execution sometimes reduces
performance
Branch misprediction characteristics are dependent on run-time behavior: input set,
control-flow path and phase behavior. The compiler cannot accurately
estimate the run-time behavior of branches
7
EE382N Guest Lecture 4.10.2006
Predicated Code Performance vs. Branch Misprediction Rate
Normal branch code performs better
Predicated code performs better
run-time (input B) profile-time (input A)
Converting a branch to predicated code could hurt performance if run-time misprediction rate is lower than profile-time misprediction rate
X
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Branch misprediction rate (%)
Execu
tio
n t
ime (
cycle
s)
predicated code
normal branch code
Execution time(normal branch code) = exec_T * P(T) + exec_N * P(N) + misp_penalty * P(misprediction)
Execution time of predicated code = exec_pred
C B
D
AT N
B
C
D
A
8
EE382N Guest Lecture 4.10.2006
Lecture Outline
Predicated execution Wish branches 2D-profiling
9
EE382N Guest Lecture 4.10.2006
Wish Branches [Kim et al. Micro-38]
A new type of control flow instruction 3 types: wish jump/join and wish loop
The compiler generates code (with wish branches) that can be executed either as predicated code or non-predicated code (normal branch code)
The hardware decides to execute predicated code or normal branch code at run-time based on the confidence of branch prediction
Easy to predict: normal branch code Hard to predict: predicated code
10
EE382N Guest Lecture 4.10.2006
TARGET: (p1) mov b,0TARGET: (1) mov b,0
(!p1) mov b,1 wish.join !p1 Join
(1) mov b,1 wish.join (1) Join
Low ConfidenceWish Jump/Join
p1 = (cond) branch p1, TARGET
C B
D
AT N
mov b, 1 jmp JOIN
TARGET: mov b,0
normal branch code
A
B
C
B
C
D
A
p1 = (cond)
(!p1) mov b,1
(p1) mov b,0
predicated code
A
B
C
wish jump/join code
B
A
C
D
wish jump
p1=(cond) wish.jump p1 TARGET
A
B
C
wish join
DJOIN:
High Confidence
nop
nop
Taken
Not-Taken
11
EE382N Guest Lecture 4.10.2006
Low Confidence
Wish Loop
X
Y
N
T
LOOP: add a, a, 1 add i, i, 1 p1 = (i<N) branch p1, LOOP
EXIT:
X
Y
N
T
H
mov p1, 1
LOOP: (p1) add a, a, 1 (p1) add i, i, 1 (p1) p1 = (cond) wish. loop p1, LOOP
EXIT:
normal backward branch code
do {
a++;
i++;
} while (i<N);
XH
X
wish loop code
Y Y
High Confidence
(1)(1)(1)
12
EE382N Guest Lecture 4.10.2006
Mispredicted Case 1: Early-Exit
X1 X2 X3 Y
T T N
Correct execution:
Early-exit:
(Low confidence)
X1 X2
T
Y
N
X3 Y
N
Flush pipeline
Compared to normal branch code: predicate data dependency and one extra instruction (-)
…
X
Y
N
T
H
H
H
13
EE382N Guest Lecture 4.10.2006
Mispredicted Case 2: Late-Exit
X1 X2 X3 Y
T T N
Correct execution:
Late-exit:
(Low confidence)
X1 X2
T
X3
T
Compared to normal branch code: pro: reduce flush penalty (+++)
cons: predicate data dependency and one extra instruction (-)
T
X4
T
X5
N
Y …nop nopX
Y
N
T
H
H
H
14
EE382N Guest Lecture 4.10.2006
Mispredicted Cases3: No-Exit
X1 X2 X3 Y
T T N
Correct execution:
Late-exit: X1 X2
T
X3
T
No-Exit: predicate data dependency and one extra instruction (-)
T
X4
T
X5
N
Y …nop nopX
Y
N
T
H
H
H
No-exit: X1 X2
T
X3
T T
X4
T
X5
T
X6 …HFlush pipeline
Y
15
EE382N Guest Lecture 4.10.2006
Questions?
What kind of branches should be converted to wish branches (jump/join)?
Why not all branches?
What kind of branches should be converted to wish loops?
16
EE382N Guest Lecture 4.10.2006
Advantages/Disadvantages of Wish Branches
Advantages compared to predicated execution Reduce the overhead of predication Increase the benefits of predicated code by
allowing the compiler to generate more aggressively-predicated code
Provide a mechanism to exploit predication to reduce the branch misprediction penalty for backward branches (Wish loops)
Make predicated code less dependent on machine configuration (e.g. branch predictor)
17
EE382N Guest Lecture 4.10.2006
Advantages/Disadvantages of Wish Branches
Disadvantages compared to predicated execution
Extra branch instructions use machine resources
Extra branch instructions increase the contention for branch predictor table entries
May constrain the compiler’s scope for code optimizations
18
EE382N Guest Lecture 4.10.2006
Wish Branch Support
ISA Supportpredicated execution, wish branch instruction
Compiler SupportWish branch generation algorithms
The compiler needs to decide which branches are predicated, which are converted to wish branches, and which stay as normal branches
Hardware SupportInstruction decode logic Predicate dependency elimination moduleConfidence estimatorFront-end and branch misprediction detection/recovery module
19
EE382N Guest Lecture 4.10.2006
ISA Support
Using existing hint bits (IA-64, x86, PowerPC) Hint bits can be ignored. A wish branch can be
treated as a normal branch.
OPCODE btype wtype target offset p
btye: branch type (0:normal branch 1:wish branch)
wtype: wish branch type (0:jump 1:loop 2:join)
p: predicate register identifier
20
EE382N Guest Lecture 4.10.2006
Wish Branch Support
ISA Supportpredicated execution, wish branch instruction
Compiler SupportWish branch generation algorithms
The compiler needs to decide which branches are predicated, which are converted to wish branches, and which stay as normal branches
Hardware SupportInstruction decode logic Predicate dependency elimination moduleConfidence estimatorFront-end and branch misprediction detection/recovery module
21
EE382N Guest Lecture 4.10.2006
Compiler Support
Major phase ordering with wish branch generation in code generation [ORC]
region formation
if-conversion
loop opt (swp, unrolling)
global inst. sched
register allocation
local inst. sched
edge/value profiling
wish jump conversion
if-conversion
wish join insertion
wish loop conversion
loop opt
select candidates
cost-benefit analysis
predicate selected blocks
branch elimination
modifiednew
existing
22
EE382N Guest Lecture 4.10.2006
Wish Branch Generation Algorithm
wish jump/join candidates: all branch which are suitable for if-conversion The number of instructions in the fall-
through block > N (N=5) : wish jump and join are inserted
All other branches converted to predicated code
A loop branch is converted into a wish loop: when the loop body has fewer than L instructions (L=30)
23
EE382N Guest Lecture 4.10.2006
Wish Branch Support
ISA Supportpredicated execution, wish branch instruction
Compiler SupportWish branch generation algorithms
The compiler needs to decide which branches are predicated, which are converted to wish branches, and which stay as normal branches
Hardware SupportInstruction decode logic Predicate dependency elimination moduleFront-end and branch misprediction detection/recovery module
Confidence estimator
24
EE382N Guest Lecture 4.10.2006
Hardware Support Instruction Fetch/decode logic
Decoder: decode wish branchesBTB: mark wish branches
Wish branch state machine hardware Wish loop stays as low-confidence mode until the
loop exits Predicate dependency elimination module
High-confidence mode: predicate values are predicted
Branch misprediction detection/recovery module No flush if wish branch is mispredicted during
low-confidence mode Confidence estimator
25
EE382N Guest Lecture 4.10.2006
Global BHR
JRS Confidence Estimator
Assigning Confidence to Conditional Branch Predictions [Jacobsen et al. Micro-29]
PC
+ 2^m entries
m bits
n bit Counters
Estimate how much confidence the processor has in a branch prediction
Trained with branch misprediction information
High Confidence Low Confidence
> th?
26
EE382N Guest Lecture 4.10.2006
Experimental Infrastructure
IA-64 provides full support for predication Convert IA-64 traces to micro-ops to simulate
an out-of-order superscalar processor model
IA-64Compiler
(ORC)
SourceCode
IA-64 Binary
IA-64 Trace µopsTrace
generationmodule
Micro-opTranslator
Micro-opSimulator
27
EE382N Guest Lecture 4.10.2006
Simulation Methodology
Nine SPEC 2000 integer benchmarks Baseline Processor Configuration
Front End Large and accurate branch predictor (64KB
hybrid branch predictor: gshare + local) Minimum 30-cycle branch misprediction penalty 64KB, 2-cycle latency I-cache
Execution Core 8-wide out-of-order processor 512-entry instruction window
Confidence Estimator 1KB tagged 16-bit history JRS confidence
estimator (Jacobsen et al. MICRO-29)
28
EE382N Guest Lecture 4.10.2006
2.02
0
0.2
0.4
0.6
0.8
1
1.2
gzip vpr mcf crafty parser gap vortex bzip2 twolf AVG AVGnomcf
No
rma
lize
d e
xecu
tion
tim
e.
SELECTIVE-PREDICATIONAGGRESSIVE-PREDICATIONwish jump/joinwish jump/join/loop
SELECTIVE-PREDICATION: branches are selectively predicated using
compile-time cost-benefit analysis
AGGRESSIVE-PREDICATION: all branches that are suitable for if-
conversion are predicated
16% over conditional branch prediction (w/o mcf)
11% over selective-predication (w/o mcf)
7 % over aggressive predication (w/o mcf)
14% over conditional branch prediction and
13% over selective-predication and
16% over aggressive-predication
12% over conditional branch prediction
11% over selective-predication
13 % over aggressive predication
2.02
0
0.2
0.4
0.6
0.8
1
1.2
gzip vpr mcf crafty parser gap vortex bzip2 twolf AVG AVGnomcf
No
rma
lize
d e
xecu
tion
tim
e.
SELECTIVE-PREDICATIONAGGRESSIVE-PREDICATIONwish jump/joinwish jump/join/loop
2.02
0
0.2
0.4
0.6
0.8
1
1.2
gzip vpr mcf crafty parser gap vortex bzip2 twolf AVG AVGnomcf
No
rma
lize
d e
xecu
tion
tim
e.
SELECTIVE-PREDICATIONAGGRESSIVE-PREDICATIONwish jump/joinwish jump/join/loop
2.02
0
0.2
0.4
0.6
0.8
1
1.2
gzip vpr mcf crafty parser gap vortex bzip2 twolf AVG AVGnomcf
No
rma
lize
d e
xecu
tion
tim
e.
SELECTIVE-PREDICATIONAGGRESSIVE-PREDICATIONwish jump/joinwish jump/join/loop
Performance Improvement
24% 8% 14%-4%
non-predicated
2.02
29
EE382N Guest Lecture 4.10.2006
Wish Branch: Conclusion
New control flow instructions: wish branches (jump/join/loop)
Wish branches improve performance by dividing the work of
predication between the compiler and the microarchitecture Compiler: analyzes the control-flow graph and generates code
Microarchitecture: makes run-time decision to use predication
Wish branches provide significant performance benefits 16% compared to conditional branch prediction
13% compared to selectively predicated code
Wish branches can make predicated execution more viable
and effective in high performance processors By enabling adaptive and aggressive predicated execution
30
EE382N Guest Lecture 4.10.2006
Lecture Outline
Predicated execution Wish branches 2D-profiling
31
EE382N Guest Lecture 4.10.2006
2D-profiling Goal: Identify input-dependent branches by
using a single input set for profiling
If We Know a Branch is Input-Dependent May not convert it to predicated code.
May convert it to a wish branch.
May not perform other compiler optimizations or may
perform them less aggressively.
Hot-path/trace/superblock-based optimizations
[Fisher’81, Pettis’90, Hwu’93, Merten’99]
32
EE382N Guest Lecture 4.10.2006
0
0.2
0.4
0.6
0.8
1
0 500 1000 1500Time (in terms of number of executed instructions x 100M)
Bra
nc
h P
red
icti
on
Ac
cu
rac
y
0
0.2
0.4
0.6
0.8
1
0 500 1000 1500Time (in terms of number of executed instructions x 100M)
Bra
nc
h P
red
icti
on
Ac
cu
rac
y
Key Insight of 2D-profiling
Phase behavior in prediction accuracyis a good indicator of input dependence
input-dependent
input-independent
phase 1
phase 2
phase 3
33
EE382N Guest Lecture 4.10.2006
Traditional Profiling
brA
time
brB
time
MEAN pr.Acc(brA) MEAN pr.Acc(brB)
behavior of brA behavior of brB
MEAN pr.Acc(brA)
MEAN pr.Acc(brB)
pr.
Acc
pr.
Acc
34
EE382N Guest Lecture 4.10.2006
2D-profiling
brA
time
brB
timeMEAN pr.Acc(brA) MEAN pr.Acc(brB)
STD pr.Acc(brA) ≠ STD pr.Acc(brB)
behavior of brA ≠ behavior of brB
A: input-dependent br, B: input-independent br
MEAN pr.Acc(brA)
STD pr.Acc(brA)
MEAN pr.Acc(brB)
STD pr.Acc(brB)pr.
Acc
pr.
Acc
35
EE382N Guest Lecture 4.10.2006
Calculate MEAN (brA, brB, …), Standard deviation (brA, brB, …),
PAM:Points Above Mean (brA, brB, …)
2D-profiling Mechanism
Slice 1 Slice 2 Slice N …
mean Pr.Acc(brA,s1)
mean Pr.Acc(brB,s1)
mean Pr.Acc(brA,s2)
mean Pr.Acc(brB,s2)
mean Pr.Acc(brA,sN)
mean Pr.Acc(brB,sN)...
...
The profiler collects branch prediction accuracy information for every static branch over time
mean brA
time
PAM:50%
PAM:0%
brA
brBmean brB
... ......
slice size = M instructions
36
EE382N Guest Lecture 4.10.2006
2D-profiling: Conclusion & Future Work
2D-profiling is a new profiling technique to find input-dependent characteristics by using a single input data set for profiling
2D-profiling uses time-varying information instead of just average data
Phase behavior in prediction accuracy in a profile run input-dependent
Future Work: Better predicated code/wish branch generation algorithms Detecting other input-dependent program characteristics