diverge-merge processor (dmp)
DESCRIPTION
Diverge-Merge Processor (DMP). Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin. Outline. Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation - PowerPoint PPT PresentationTRANSCRIPT
Diverge-Merge Processor (DMP)
Hyesoon Kim José A. Joao
Onur Mutlu* Yale N. Patt
HPS Research Group *Microsoft ResearchUniversity of Texas at Austin
2
Outline
Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion
3
Predicated Execution
Convert control flow dependence to data dependence
(normal branch code)
C B
D
AT N
p1 = (cond) branch p1, TARGET
mov b, 1 jmp JOIN
TARGET: mov b, 0
A
B
C
B
C
D
A
(predicated code)
A
B
C
if (cond) { b = 0;}else { b = 1;} p1 = (cond)
(!p1) mov b, 1
(p1) mov b, 0
4
Fetch Decode Rename Schedule RegisterRead Execute
Benefit of Predicated Execution Predicated Execution can be high
performance and energy-efficient.
A
BC
D
AE
F
Predicated Execution
Branch Prediction
Pipeline flush!!
E D BF
nop
Fetch Decode Rename Schedule RegisterRead Execute
AB AC B AC BD AD C BE AE D CF B AF E D C B A AF BCDEF E D ABCF E ABCDF E D C B AF E D C ABE D C B AF AF BCDE
5
Limitations/Problems of Predication
ISA: Predicate registers and predicated instructions Dynamic-Hammock Predication[Klauser’98] can solve this problem but
it is only applicable to simple hammocks.
Adaptivity: Static predication is not adaptive to run-time branch behavior. Branch behavior changes based on input set, phase, control-flow path. Wish Branches[Kim’05]
Complex CFG: A large subset of control-flow graphs is not converted to predicated code. Function calls, loops, many instructions inside a region,
and complex CFGs Hyperblock[Mahlke’92] cannot adapt to frequently-executed paths
dynamically.
6
Outline
Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion
7
Diverge-Merge Processor (DMP)
DMP can dynamically predicate complex branches
(in addition to simple hammocks).
The compiler identifies Diverge branches
Control-flow merge (CFM) points
The microarchitecture decides when and what to
predicate dynamically.
8
select-µops (φ-nodes in SSA)
Dynamic Predication
A
B
C
H
Klauser et al.[PACT’98]: Dynamic-hammock predication
C B
H
AT N
mov R1, 1 jmp JOIN
TARGET: mov R1, 0
A
B
C
p1 = (cond) branch p1, TARGET
(mov R1, 1)PR10 = 1
(mov R1, 0)PR11 = 0
PR12 = (cond) ? PR11 : PR10
Low-confidence
H JOIN: add R5, R1, 1
9
Diverge-Merge Processor
C B
E
D
F G
Frequently executed path
Not frequently executed path
A
C
E
B
H
Insert select-µops
Diverge Branch
CFM point
A
H
10
diverge-branch executed block CFM point
Diverge-Merge Processor
C B
E
D
F G
Frequently executed path
Not frequently executed path
A A A
A A A
A
H
11
Control-Flow GraphsA
simple hammock
A
nested hammock
A
frequently-hammock
A
loop
A
. . . . . . . . . . .
non-merging
DMP
Dynamic Hammock
SW pred
Wish br.
Dual-path
12
Dual-path Execution vs. DMP
Low-confidence
C
D
E
F
B
D
E
F
A
BC
D
E
F
path 1 path 2
C
D
E
F
B
path 1 path 2
Dual-path DMP
CFMCFM
13
Control-Flow GraphsA
simple hammock
A
nested hammock
A
frequently-hammock
A
loop
A
. . . . . . . . . . .
non-merging
DMP
Dynamic-hammock
SW pred
Wish br.
Dual-path
sometimes
sometimes
14
0
2
4
6
8
10
12
gzip vp
rgc
cm
cf
craf
ty
pars
er eon
perlb
mk
gap
vorte
xbz
ip2 twol
f
com
p goijp
eg li
m88
ksim
amea
n
Mis
pre
dic
tio
ns
pe
r k
ilo in
str
uc
tio
ns
(M
PK
I)
non-merging
loop
frequently
nested
simple
Distribution of Mispredicted Branches 66% of mispredicted branches can be dynamically
predicated in DMP.
15
0
2
4
6
8
10
12
gzip vp
rgc
cm
cf
craf
ty
pars
er eon
perlb
mk
gap
vorte
xbz
ip2 twol
f
com
p goijp
eg li
m88
ksim
amea
n
Mis
pre
dic
tio
ns
pe
r k
ilo in
str
uc
tio
ns
(M
PK
I)
non-merging
loop
frequently
nested
simple
Distribution of Mispredicted Branches 66% of mispredicted branches can be dynamically
predicated in DMP.
16
Outline
Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion
17
Fetch Mechanism
C B
E
D
F G
predicted path
A
C
E
B
H
Diverge Branch
CFM point
A
H
Low Confidence
Round-robin fetch
18
PR21PR11PR41
add pr21 pr13, #1 (p1)
Dynamic Predication
Arch. Phy. M
R1
R2 PR12
R3 PR13
A
C
E
B
H
branch r0, C
add r1 r3, #1
add r4 r1, r3
add r1 r2, # -1
branch pr10,C p1 = pr10
add pr24 pr41, pr13
add pr31 pr12, # -1(!p1)
Arch. Phy. M
R1
R2 PR12
R3 PR13
PR31
1
1
select-µop pr41 = p1? pr21 : pr31
RAT2
RAT1
Forks RAT, RAS, and GHR
PR11
19
DMP Support
ISA Support Mark diverge branches/CFM points.
Compiler Support [CGO’07] The compiler identifies diverge branches and the
corresponding CFM points. Hardware Support
Confidence estimator Fetch mechanisms Load/store processing Instruction retirement Dynamic predication
20
Hardware Complexity Analysis
ST-LD Forwarding
SWpred.
Dualpath
Select-Uop Gen.
Rename Support
Front-End
Check Flush/no Flush
Predicate Registers
Confidence Estimator
Wishbr.
Multi path
Dyn.ham.
DMP
21
Outline
Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion
22
Simulation Methodology
12 SPEC 2000 INT, 5 SPEC 95 INT Different input sets for profiling and evaluation
Alpha ISA execution driven simulator Baseline processor configuration
64KB perceptron predictor/O-GEHL (paper) Minimum 30-cycle branch misprediction penalty 8-wide, 512-entry instruction window 2 KB 12-bit history enhanced JRS confidence
estimator Less aggressive processor (paper) Power model using Wattch
23
0
10
20
30
40
50
60
gzip
vpr
gcc
mcf
craf
ty
pars
er
eon
perlb
mk
gap
vort
ex
bzip
2
twol
f
com
p go
ijpeg
li
m88
ksim
hmea
n
IPC
im
pro
vem
ent
(%)
simplesimple,nestedsimple,nested,frequentlysimple,nested,frequently,loop
Different CFG types
24
Performance Improvement
0
5
10
15
20
25
Per
form
ance
Im
pro
vem
ent
(%) DMP
dynamic-hammockdual-pathmultipathlimited software predicationwish branches
25
Energy Consumption
-5
0
5
10
Red
uct
ion
(%
)
DMPdynamic-hammockdual-pathmultipathlimited software predicationwish branches
26
Outline
Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion
27
Conclusion DMP introduces the concept of frequently-hammocks and it
dynamically predicates complex CFGs.
DMP can overcome the three major limitations of software predication: ISA support, adaptivity, complex CFG.
DMP reduces branch mispredictions energy efficiently 19% performance improvement, 9% less energy
DMP divides the work between the compiler and the microarchitecture: The compiler analyzes the control-flow graphs. The microarchitecture decides when and what to predicate
dynamically.
Thank You!!
Questions?