bungee jumps: accelerating indirect branches …lea edi, [rsi+rcx*1] cmp edi, edx jz j e ld edx,...
TRANSCRIPT
Bungee Jumps:Accelerating Indirect Branches Through Hardware/Software
Co-Design Daniel S. McFarlin
Craig Zilles
1
Indirect Branches Are Increasingly Predictable
2
Better
Indirect Branch
Predictability
Weighted Average Indirect
Branch BiasbtreefannkuchfastafloatknukemandelbrotmeteornbodyqueensraytraceregexdnarevcomprichardsspecnormGeomean
94.862840739 71.06865491491.69069463 64.80385336299.85320364 81.398931811
97.317006175 73.80388728699.263857238 87.481192648
92.6457219 68.124670899.046499732 75.93342432791.788141519 70.56609820898.393163943 70.73422413491.206546801 73.28397545788.678664125 71.81510786199.96575296 90.146778028
92.096928801 71.51024038297.17686519 63.895228067
95.212001287 73.546478657
0
5
10
15
20
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Avg
Mis
pred
icts
/Kilo
Inst
rs Nehalem Sandy Bridge Haswell TAGE
0
25
50
75
100
btree
fannkuch
fasta
float
knuke
mandelb
rot
meteor
nbody
queens
raytra
ce
regexd
na
revco
mp
richard
s
specn
orm
Geomean
Bran
ch B
ias/
Pred
icta
bilit
y Indirect Branch Predictability Weighted Average Indirect Branch Bias
Indirect Branches Are Increasingly Predictable
3
Better
Indirect Branch
Predictability
Weighted Average Indirect
Branch BiasbtreefannkuchfastafloatknukemandelbrotmeteornbodyqueensraytraceregexdnarevcomprichardsspecnormGeomean
94.862840739 71.06865491491.69069463 64.80385336299.85320364 81.398931811
97.317006175 73.80388728699.263857238 87.481192648
92.6457219 68.124670899.046499732 75.93342432791.788141519 70.56609820898.393163943 70.73422413491.206546801 73.28397545788.678664125 71.81510786199.96575296 90.146778028
92.096928801 71.51024038297.17686519 63.895228067
95.212001287 73.546478657
0
5
10
15
20
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Avg
Mis
pred
icts
/Kilo
Inst
rs Nehalem Sandy Bridge Haswell TAGE
0
25
50
75
100
btree
fannkuch
fasta
float
knuke
mandelb
rot
meteor
nbody
queens
raytra
ce
regexd
na
revco
mp
richard
s
specn
orm
Geomean
Bran
ch B
ias/
Pred
icta
bilit
y Indirect Branch Predictability Weighted Average Indirect Branch Bias
0
0.25
0.5
0.75
1
meteo
rray
trace
btree
fannkuch
fasta
richards
nqueens
revcomp float
specnorm
regexdna
knuke
mandelb
rotGeom
ean
predictability biasAnd Unbiased
4
In-‐Order Machines Specialize Based on Branch Bias or Eliminate Branch PredicCon Altogether
Shape
area()
area()
area()
VTable60
30
10
4
In-‐Order Machines Specialize Based on Branch Bias or Eliminate Branch PredicCon Altogether
Shape
area()
area()
area()
VTable60
30
10 area()
area()
area()
if s->type == Circle
else if s->type == Rect
else if s->type == Square
else
RCPO
area()
4
In-‐Order Machines Specialize Based on Branch Bias or Eliminate Branch PredicCon Altogether
Shape
area()
area()
area()
VTable60
30
10 area()
area()
area()
if s->type == Circle
else if s->type == Rect
else if s->type == Square
else
RCPO
area()
if (obj is type B) r = B::func( );else if (obj is type C) r = C::func( );else if (obj is type D) r = D::func( );else r = obj->func( );
p0 = (obj is type B);p1 = (obj is type C)p2 = (obj is type D) p0: r = B::func( );p1: r = C::func( );p2: r = D::func( );
if( !(p0 | p1 | p2)) r = obj->func( );
r = obj->func( ); PredicationRCPO Predication
Challenge: Non-‐Reconvergence & Large Number of Targets
5
Gld r8, [rip+0x5e]ld edi, [rsi*4+0x42]movsxd rcx, r8 ld r10, [rcx*4+0x42]cmp edi, r10jnz LC
ld edx, [rsi*4+0x7d]cmp edx, 0x6jz I
Dld ecx, [rip+0x58]ld esi, [rip+0x81]lea edi, [rsi+rcx*1]cmp edi, edxjz J
Eld edx, [rcx*8+0x10]test edx, edxjz K
Bld r8, [rip+0x5b]ld edi, [rsi*4+0x92]movsxd rcx, r8ld r10, [rcx*4+0x92]cmp edi, r10jnz H
Ald r9, [rax*8+0x94]jmp r9
25
24 24 14
Fld r8, [rip+0x39]ld edi, [rsi*4+0x46]movsxd rcx, r8 ld r10, [rcx*4+0x46]cmp edi, r10jnz M
7
11
99 1
19
5 95
96 499 1
1 99
sjeng: f_in_check
Text
Challenge: Non-‐Reconvergence & Large Number of Targets
5
Gld r8, [rip+0x5e]ld edi, [rsi*4+0x42]movsxd rcx, r8 ld r10, [rcx*4+0x42]cmp edi, r10jnz LC
ld edx, [rsi*4+0x7d]cmp edx, 0x6jz I
Dld ecx, [rip+0x58]ld esi, [rip+0x81]lea edi, [rsi+rcx*1]cmp edi, edxjz J
Eld edx, [rcx*8+0x10]test edx, edxjz K
Bld r8, [rip+0x5b]ld edi, [rsi*4+0x92]movsxd rcx, r8ld r10, [rcx*4+0x92]cmp edi, r10jnz H
Ald r9, [rax*8+0x94]jmp r9
25
24 24 14
Fld r8, [rip+0x39]ld edi, [rsi*4+0x46]movsxd rcx, r8 ld r10, [rcx*4+0x46]cmp edi, r10jnz M
7
11
99 1
19
5 95
96 499 1
1 99
sjeng: f_in_check
Text
Challenge: Non-‐Reconvergence & Large Number of Targets
5
Gld r8, [rip+0x5e]ld edi, [rsi*4+0x42]movsxd rcx, r8 ld r10, [rcx*4+0x42]cmp edi, r10jnz LC
ld edx, [rsi*4+0x7d]cmp edx, 0x6jz I
Dld ecx, [rip+0x58]ld esi, [rip+0x81]lea edi, [rsi+rcx*1]cmp edi, edxjz J
Eld edx, [rcx*8+0x10]test edx, edxjz K
Bld r8, [rip+0x5b]ld edi, [rsi*4+0x92]movsxd rcx, r8ld r10, [rcx*4+0x92]cmp edi, r10jnz H
Ald r9, [rax*8+0x94]jmp r9
25
24 24 14
Fld r8, [rip+0x39]ld edi, [rsi*4+0x46]movsxd rcx, r8 ld r10, [rcx*4+0x46]cmp edi, r10jnz M
7
11
99 1
19
5 95
96 499 1
1 99
sjeng: f_in_check
Text
Challenge: Non-‐Reconvergence & Large Number of Targets
5
Gld r8, [rip+0x5e]ld edi, [rsi*4+0x42]movsxd rcx, r8 ld r10, [rcx*4+0x42]cmp edi, r10jnz LC
ld edx, [rsi*4+0x7d]cmp edx, 0x6jz I
Dld ecx, [rip+0x58]ld esi, [rip+0x81]lea edi, [rsi+rcx*1]cmp edi, edxjz J
Eld edx, [rcx*8+0x10]test edx, edxjz K
Bld r8, [rip+0x5b]ld edi, [rsi*4+0x92]movsxd rcx, r8ld r10, [rcx*4+0x92]cmp edi, r10jnz H
Ald r9, [rax*8+0x94]jmp r9
25
24 24 14
Fld r8, [rip+0x39]ld edi, [rsi*4+0x46]movsxd rcx, r8 ld r10, [rcx*4+0x46]cmp edi, r10jnz M
7
11
99 1
19
5 95
96 499 1
1 99
sjeng: f_in_check
Text
Challenge: Non-‐Reconvergence & Large Number of Targets
5
Gld r8, [rip+0x5e]ld edi, [rsi*4+0x42]movsxd rcx, r8 ld r10, [rcx*4+0x42]cmp edi, r10jnz LC
ld edx, [rsi*4+0x7d]cmp edx, 0x6jz I
Dld ecx, [rip+0x58]ld esi, [rip+0x81]lea edi, [rsi+rcx*1]cmp edi, edxjz J
Eld edx, [rcx*8+0x10]test edx, edxjz K
Bld r8, [rip+0x5b]ld edi, [rsi*4+0x92]movsxd rcx, r8ld r10, [rcx*4+0x92]cmp edi, r10jnz H
Ald r9, [rax*8+0x94]jmp r9
25
24 24 14
Fld r8, [rip+0x39]ld edi, [rsi*4+0x46]movsxd rcx, r8 ld r10, [rcx*4+0x46]cmp edi, r10jnz M
7
11
99 1
19
5 95
96 499 1
1 99
sjeng: f_in_check
Text
Challenge: Non-‐Reconvergence & Large Number of Targets
5
Gld r8, [rip+0x5e]ld edi, [rsi*4+0x42]movsxd rcx, r8 ld r10, [rcx*4+0x42]cmp edi, r10jnz LC
ld edx, [rsi*4+0x7d]cmp edx, 0x6jz I
Dld ecx, [rip+0x58]ld esi, [rip+0x81]lea edi, [rsi+rcx*1]cmp edi, edxjz J
Eld edx, [rcx*8+0x10]test edx, edxjz K
Bld r8, [rip+0x5b]ld edi, [rsi*4+0x92]movsxd rcx, r8ld r10, [rcx*4+0x92]cmp edi, r10jnz H
Ald r9, [rax*8+0x94]jmp r9
25
24 24 14
Fld r8, [rip+0x39]ld edi, [rsi*4+0x46]movsxd rcx, r8 ld r10, [rcx*4+0x46]cmp edi, r10jnz M
7
11
99 1
19
5 95
96 499 1
1 99
sjeng: f_in_check
Text
Challenge: Non-‐Reconvergence & Large Number of Targets
5
Gld r8, [rip+0x5e]ld edi, [rsi*4+0x42]movsxd rcx, r8 ld r10, [rcx*4+0x42]cmp edi, r10jnz LC
ld edx, [rsi*4+0x7d]cmp edx, 0x6jz I
Dld ecx, [rip+0x58]ld esi, [rip+0x81]lea edi, [rsi+rcx*1]cmp edi, edxjz J
Eld edx, [rcx*8+0x10]test edx, edxjz K
Bld r8, [rip+0x5b]ld edi, [rsi*4+0x92]movsxd rcx, r8ld r10, [rcx*4+0x92]cmp edi, r10jnz H
Ald r9, [rax*8+0x94]jmp r9
25
24 24 14
Fld r8, [rip+0x39]ld edi, [rsi*4+0x46]movsxd rcx, r8 ld r10, [rcx*4+0x46]cmp edi, r10jnz M
7
11
99 1
19
5 95
96 499 1
1 99
sjeng: f_in_check
Text
Challenge: Non-‐Reconvergence & Large Number of Targets
5
Gld r8, [rip+0x5e]ld edi, [rsi*4+0x42]movsxd rcx, r8 ld r10, [rcx*4+0x42]cmp edi, r10jnz LC
ld edx, [rsi*4+0x7d]cmp edx, 0x6jz I
Dld ecx, [rip+0x58]ld esi, [rip+0x81]lea edi, [rsi+rcx*1]cmp edi, edxjz J
Eld edx, [rcx*8+0x10]test edx, edxjz K
Bld r8, [rip+0x5b]ld edi, [rsi*4+0x92]movsxd rcx, r8ld r10, [rcx*4+0x92]cmp edi, r10jnz H
Ald r9, [rax*8+0x94]jmp r9
25
24 24 14
Fld r8, [rip+0x39]ld edi, [rsi*4+0x46]movsxd rcx, r8 ld r10, [rcx*4+0x46]cmp edi, r10jnz M
7
11
99 1
19
5 95
96 499 1
1 99
sjeng: f_in_check
Text
Challenge: Non-‐Reconvergence & Large Number of Targets
5
Gld r8, [rip+0x5e]ld edi, [rsi*4+0x42]movsxd rcx, r8 ld r10, [rcx*4+0x42]cmp edi, r10jnz LC
ld edx, [rsi*4+0x7d]cmp edx, 0x6jz I
Dld ecx, [rip+0x58]ld esi, [rip+0x81]lea edi, [rsi+rcx*1]cmp edi, edxjz J
Eld edx, [rcx*8+0x10]test edx, edxjz K
Bld r8, [rip+0x5b]ld edi, [rsi*4+0x92]movsxd rcx, r8ld r10, [rcx*4+0x92]cmp edi, r10jnz H
Ald r9, [rax*8+0x94]jmp r9
25
24 24 14
Fld r8, [rip+0x39]ld edi, [rsi*4+0x46]movsxd rcx, r8 ld r10, [rcx*4+0x46]cmp edi, r10jnz M
7
11
99 1
19
5 95
96 499 1
1 99
sjeng: f_in_check
Text
Challenge: Non-‐Reconvergence & Large Number of Targets
5
Gld r8, [rip+0x5e]ld edi, [rsi*4+0x42]movsxd rcx, r8 ld r10, [rcx*4+0x42]cmp edi, r10jnz LC
ld edx, [rsi*4+0x7d]cmp edx, 0x6jz I
Dld ecx, [rip+0x58]ld esi, [rip+0x81]lea edi, [rsi+rcx*1]cmp edi, edxjz J
Eld edx, [rcx*8+0x10]test edx, edxjz K
Bld r8, [rip+0x5b]ld edi, [rsi*4+0x92]movsxd rcx, r8ld r10, [rcx*4+0x92]cmp edi, r10jnz H
Ald r9, [rax*8+0x94]jmp r9
25
24 24 14
Fld r8, [rip+0x39]ld edi, [rsi*4+0x46]movsxd rcx, r8 ld r10, [rcx*4+0x46]cmp edi, r10jnz M
7
11
99 1
19
5 95
96 499 1
1 99
sjeng: f_in_check
Text
Challenge: Non-‐Reconvergence & Large Number of Targets
5
Gld r8, [rip+0x5e]ld edi, [rsi*4+0x42]movsxd rcx, r8 ld r10, [rcx*4+0x42]cmp edi, r10jnz LC
ld edx, [rsi*4+0x7d]cmp edx, 0x6jz I
Dld ecx, [rip+0x58]ld esi, [rip+0x81]lea edi, [rsi+rcx*1]cmp edi, edxjz J
Eld edx, [rcx*8+0x10]test edx, edxjz K
Bld r8, [rip+0x5b]ld edi, [rsi*4+0x92]movsxd rcx, r8ld r10, [rcx*4+0x92]cmp edi, r10jnz H
Ald r9, [rax*8+0x94]jmp r9
25
24 24 14
Fld r8, [rip+0x39]ld edi, [rsi*4+0x46]movsxd rcx, r8 ld r10, [rcx*4+0x46]cmp edi, r10jnz M
7
11
99 1
19
5 95
96 499 1
1 99
sjeng: f_in_check
Text
Challenge: Non-‐Reconvergence & Large Number of Targets
5
Gld r8, [rip+0x5e]ld edi, [rsi*4+0x42]movsxd rcx, r8 ld r10, [rcx*4+0x42]cmp edi, r10jnz LC
ld edx, [rsi*4+0x7d]cmp edx, 0x6jz I
Dld ecx, [rip+0x58]ld esi, [rip+0x81]lea edi, [rsi+rcx*1]cmp edi, edxjz J
Eld edx, [rcx*8+0x10]test edx, edxjz K
Bld r8, [rip+0x5b]ld edi, [rsi*4+0x92]movsxd rcx, r8ld r10, [rcx*4+0x92]cmp edi, r10jnz H
Ald r9, [rax*8+0x94]jmp r9
25
24 24 14
Fld r8, [rip+0x39]ld edi, [rsi*4+0x46]movsxd rcx, r8 ld r10, [rcx*4+0x46]cmp edi, r10jnz M
7
11
99 1
19
5 95
96 499 1
1 99
sjeng: f_in_check
Text
Missed OpCmizaCon Opportunity: Next Branch Bias
6
80859095
100
gcc9
5 li
m88ksi
mperl
95 eon
gap
gcc2
kperl
2k
gcc2
k6
h264
ref
omen
tpp
perl2k
6pov
ray sjeng
xalan
c
richa
rds
Geomea
n
Bran
ch B
ias
Missed OpCmizaCon Opportunity: Next Branch Bias
6
80859095
100
gcc9
5 li
m88ksi
mperl
95 eon
gap
gcc2
kperl
2k
gcc2
k6
h264
ref
omen
tpp
perl2k
6pov
ray sjeng
xalan
c
richa
rds
Geomea
n
Bran
ch B
ias
80859095
100
bench
btree
fannk
uch
fasta
fastar
edux
knuk
e
mandelb
rotnb
ody
regexd
na
revco
mp run
spec
norm
Geomea
n
Bran
ch B
ias
Missed OpCmizaCon Opportunity: Next Branch Bias
6
80859095
100
gcc9
5 li
m88ksi
mperl
95 eon
gap
gcc2
kperl
2k
gcc2
k6
h264
ref
omen
tpp
perl2k
6pov
ray sjeng
xalan
c
richa
rds
Geomea
n
Bran
ch B
ias
80859095
100
bench
btree
fannk
uch
fasta
fastar
edux
knuk
e
mandelb
rotnb
ody
regexd
na
revco
mp run
spec
norm
Geomea
n
Bran
ch B
ias
80859095
100
btree
fannk
uch
fasta
float
knuk
eman
delbr
otmete
ornb
ody
quee
nsray
trace
regex
dna
revco
mpric
hards
spec
norm
Geomea
n
Bran
ch B
ias
ExploiCng Predictability: Benefit from Next Branch Bias
7
Gld r8, [rip+0x5e]ld edi, [rsi*4+0x42]movsxd rcx, r8 assert r9, Gld r10, [rcx*4+0x42]cmp edi, r10jnz L
Hoist From
Cld edx, [rsi*4+0x7d]assert r9, Ccmp edx, 0x6jz I
Dld ecx, [rip+0x58]ld esi, [rip+0x81]lea edi, [rsi+rcx*1]assert r9, Dcmp edi, edxjz J
Eld edx, [rcx*8+0x10]assert r9, Etest edx, edxjz KHoist
From
Hoist From
Hoist From
Bld r8, [rip+0x5b]ld edi, [rsi*4+0x92]movsxd rcx, r8ld r10, [rcx*4+0x92]cmp edi, r10assert r9, Bjnz H
Hoist From
Ald r9, [rax*8+0x94]pred-indirect-jump
25
24 24 14
Fld r8, [rip+0x39]ld edi, [rsi*4+0x46]movsxd rcx, r8 assert r9, Fld r10, [rcx*4+0x46]cmp edi, r10jnz M
11
99 1
19
5 95
96 499 1
1 99
ExploiCng Predictability: Benefit from Next Branch Bias
7
Gld r8, [rip+0x5e]ld edi, [rsi*4+0x42]movsxd rcx, r8 assert r9, Gld r10, [rcx*4+0x42]cmp edi, r10jnz L
Hoist From
Cld edx, [rsi*4+0x7d]assert r9, Ccmp edx, 0x6jz I
Dld ecx, [rip+0x58]ld esi, [rip+0x81]lea edi, [rsi+rcx*1]assert r9, Dcmp edi, edxjz J
Eld edx, [rcx*8+0x10]assert r9, Etest edx, edxjz KHoist
From
Hoist From
Hoist From
Bld r8, [rip+0x5b]ld edi, [rsi*4+0x92]movsxd rcx, r8ld r10, [rcx*4+0x92]cmp edi, r10assert r9, Bjnz H
Hoist From
Ald r9, [rax*8+0x94]pred-indirect-jump
25
24 24 14
Fld r8, [rip+0x39]ld edi, [rsi*4+0x46]movsxd rcx, r8 assert r9, Fld r10, [rcx*4+0x46]cmp edi, r10jnz M
11
99 1
19
5 95
96 499 1
1 99
ExploiCng Predictability: Benefit from Next Branch Bias
7
Gld r8, [rip+0x5e]ld edi, [rsi*4+0x42]movsxd rcx, r8 assert r9, Gld r10, [rcx*4+0x42]cmp edi, r10jnz L
Hoist From
Cld edx, [rsi*4+0x7d]assert r9, Ccmp edx, 0x6jz I
Dld ecx, [rip+0x58]ld esi, [rip+0x81]lea edi, [rsi+rcx*1]assert r9, Dcmp edi, edxjz J
Eld edx, [rcx*8+0x10]assert r9, Etest edx, edxjz KHoist
From
Hoist From
Hoist From
Bld r8, [rip+0x5b]ld edi, [rsi*4+0x92]movsxd rcx, r8ld r10, [rcx*4+0x92]cmp edi, r10assert r9, Bjnz H
Hoist From
Ald r9, [rax*8+0x94]pred-indirect-jump
25
24 24 14
Fld r8, [rip+0x39]ld edi, [rsi*4+0x46]movsxd rcx, r8 assert r9, Fld r10, [rcx*4+0x46]cmp edi, r10jnz M
11
99 1
19
5 95
96 499 1
1 99
ExploiCng Predictability: Benefit from Next Branch Bias
7
Gld r8, [rip+0x5e]ld edi, [rsi*4+0x42]movsxd rcx, r8 assert r9, Gld r10, [rcx*4+0x42]cmp edi, r10jnz L
Hoist From
Cld edx, [rsi*4+0x7d]assert r9, Ccmp edx, 0x6jz I
Dld ecx, [rip+0x58]ld esi, [rip+0x81]lea edi, [rsi+rcx*1]assert r9, Dcmp edi, edxjz J
Eld edx, [rcx*8+0x10]assert r9, Etest edx, edxjz KHoist
From
Hoist From
Hoist From
Bld r8, [rip+0x5b]ld edi, [rsi*4+0x92]movsxd rcx, r8ld r10, [rcx*4+0x92]cmp edi, r10assert r9, Bjnz H
Hoist From
Ald r9, [rax*8+0x94]pred-indirect-jump
25
24 24 14
Fld r8, [rip+0x39]ld edi, [rsi*4+0x46]movsxd rcx, r8 assert r9, Fld r10, [rcx*4+0x46]cmp edi, r10jnz M
11
99 1
19
5 95
96 499 1
1 99
ExploiCng Predictability: Benefit from Next Branch Bias
7
Gld r8, [rip+0x5e]ld edi, [rsi*4+0x42]movsxd rcx, r8 assert r9, Gld r10, [rcx*4+0x42]cmp edi, r10jnz L
Hoist From
Cld edx, [rsi*4+0x7d]assert r9, Ccmp edx, 0x6jz I
Dld ecx, [rip+0x58]ld esi, [rip+0x81]lea edi, [rsi+rcx*1]assert r9, Dcmp edi, edxjz J
Eld edx, [rcx*8+0x10]assert r9, Etest edx, edxjz KHoist
From
Hoist From
Hoist From
Bld r8, [rip+0x5b]ld edi, [rsi*4+0x92]movsxd rcx, r8ld r10, [rcx*4+0x92]cmp edi, r10assert r9, Bjnz H
Hoist From
Ald r9, [rax*8+0x94]pred-indirect-jump
25
24 24 14
Fld r8, [rip+0x39]ld edi, [rsi*4+0x46]movsxd rcx, r8 assert r9, Fld r10, [rcx*4+0x46]cmp edi, r10jnz M
11
99 1
19
5 95
96 499 1
1 99
ExploiCng Predictability: Benefit from Next Branch Bias
7
Gld r8, [rip+0x5e]ld edi, [rsi*4+0x42]movsxd rcx, r8 assert r9, Gld r10, [rcx*4+0x42]cmp edi, r10jnz L
Hoist From
Cld edx, [rsi*4+0x7d]assert r9, Ccmp edx, 0x6jz I
Dld ecx, [rip+0x58]ld esi, [rip+0x81]lea edi, [rsi+rcx*1]assert r9, Dcmp edi, edxjz J
Eld edx, [rcx*8+0x10]assert r9, Etest edx, edxjz KHoist
From
Hoist From
Hoist From
Bld r8, [rip+0x5b]ld edi, [rsi*4+0x92]movsxd rcx, r8ld r10, [rcx*4+0x92]cmp edi, r10assert r9, Bjnz H
Hoist From
Ald r9, [rax*8+0x94]pred-indirect-jump
25
24 24 14
Fld r8, [rip+0x39]ld edi, [rsi*4+0x46]movsxd rcx, r8 assert r9, Fld r10, [rcx*4+0x46]cmp edi, r10jnz M
11
99 1
19
5 95
96 499 1
1 99
ExploiCng Predictability: Benefit from Next Branch Bias
7
Gld r8, [rip+0x5e]ld edi, [rsi*4+0x42]movsxd rcx, r8 assert r9, Gld r10, [rcx*4+0x42]cmp edi, r10jnz L
Hoist From
Cld edx, [rsi*4+0x7d]assert r9, Ccmp edx, 0x6jz I
Dld ecx, [rip+0x58]ld esi, [rip+0x81]lea edi, [rsi+rcx*1]assert r9, Dcmp edi, edxjz J
Eld edx, [rcx*8+0x10]assert r9, Etest edx, edxjz KHoist
From
Hoist From
Hoist From
Bld r8, [rip+0x5b]ld edi, [rsi*4+0x92]movsxd rcx, r8ld r10, [rcx*4+0x92]cmp edi, r10assert r9, Bjnz H
Hoist From
Ald r9, [rax*8+0x94]pred-indirect-jump
25
24 24 14
Fld r8, [rip+0x39]ld edi, [rsi*4+0x46]movsxd rcx, r8 assert r9, Fld r10, [rcx*4+0x46]cmp edi, r10jnz M
11
99 1
19
5 95
96 499 1
1 99
ExploiCng Predictability: Benefit from Next Branch Bias
7
Gld r8, [rip+0x5e]ld edi, [rsi*4+0x42]movsxd rcx, r8 assert r9, Gld r10, [rcx*4+0x42]cmp edi, r10jnz L
Hoist From
Cld edx, [rsi*4+0x7d]assert r9, Ccmp edx, 0x6jz I
Dld ecx, [rip+0x58]ld esi, [rip+0x81]lea edi, [rsi+rcx*1]assert r9, Dcmp edi, edxjz J
Eld edx, [rcx*8+0x10]assert r9, Etest edx, edxjz KHoist
From
Hoist From
Hoist From
Bld r8, [rip+0x5b]ld edi, [rsi*4+0x92]movsxd rcx, r8ld r10, [rcx*4+0x92]cmp edi, r10assert r9, Bjnz H
Hoist From
Ald r9, [rax*8+0x94]pred-indirect-jump
25
24 24 14
Fld r8, [rip+0x39]ld edi, [rsi*4+0x46]movsxd rcx, r8 assert r9, Fld r10, [rcx*4+0x46]cmp edi, r10jnz M
11
99 1
19
5 95
96 499 1
1 99
Challenge: Invalid Predicted Target Address
8
...
0: r0 = load[a]
1: r2 = r0 + r1
2: jmp [r2]Prediction Point
A
B
C
D
Valid Targets: {A, G}Predict A
Challenge: Invalid Predicted Target Address
8
...
0: r0 = load[a]
1: r2 = r0 + r1
2: jmp [r2]Prediction Point
A
B
C
D
Valid Targets: {A, G}Predict A
Challenge: Invalid Predicted Target Address
8
...
0: r0 = load[a]
1: r2 = r0 + r1
2: jmp [r2]Prediction Point
A
B
C
D
Valid Targets: {A, G}Predict A
...
0: r0 = load[a]
predict
...
2: resolve r3, A
A
B
C
D
1: r2 = r0 + r1
2*: r3 = load[r2]
Challenge: Invalid Predicted Target Address
8
...
0: r0 = load[a]
1: r2 = r0 + r1
2: jmp [r2]Prediction Point
A
B
C
D
Valid Targets: {A, G}Predict A
...
0: r0 = load[a]
predict
...
2: resolve r3, A
A
B
C
D
1: r2 = r0 + r1
2*: r3 = load[r2]
Challenge: Invalid Predicted Target Address
8
...
0: r0 = load[a]
1: r2 = r0 + r1
2: jmp [r2]Prediction Point
A
B
C
D
Valid Targets: {A, G}Predict A
...
0: r0 = load[a]
predict
...
2: resolve r3, A
A
B
C
D
1: r2 = r0 + r1
2*: r3 = load[r2]
Challenge: Invalid Predicted Target Address
8
...
0: r0 = load[a]
1: r2 = r0 + r1
2: jmp [r2]Prediction Point
A
B
C
D
Valid Targets: {A, G}Predict A
...
0: r0 = load[a]
predict
...
2: resolve r3, A
A
B
C
D
1: r2 = r0 + r1
2*: r3 = load[r2]
Challenge: Invalid Predicted Target Address
8
...
0: r0 = load[a]
1: r2 = r0 + r1
2: jmp [r2]Prediction Point
A
B
C
D
Valid Targets: {A, G}Predict A
...
0: r0 = load[a]
predict
...
2: resolve r3, A
A
B
C
D
1: r2 = r0 + r1
2*: r3 = load[r2]
M
N
O
P
...
0: r0 = load[a]
predict
...
Invalid Target: M
Challenge: Invalid Predicted Target Address
8
...
0: r0 = load[a]
1: r2 = r0 + r1
2: jmp [r2]Prediction Point
A
B
C
D
Valid Targets: {A, G}Predict A
...
0: r0 = load[a]
predict
...
2: resolve r3, A
A
B
C
D
1: r2 = r0 + r1
2*: r3 = load[r2]
M
N
O
P
...
0: r0 = load[a]
predict
...
Invalid Target: M
Challenge: Invalid Predicted Target Address
8
...
0: r0 = load[a]
1: r2 = r0 + r1
2: jmp [r2]Prediction Point
A
B
C
D
Valid Targets: {A, G}Predict A
...
0: r0 = load[a]
predict
...
2: resolve r3, A
A
B
C
D
1: r2 = r0 + r1
2*: r3 = load[r2]
M
N
O
P
...
0: r0 = load[a]
predict
...
Invalid Target: M
SoluCon: Landing Pad
9
...
0: r0 = load[a]
predict 0x0
2: resolve r3, A
...
A
B
C
D
marker 0x0
1: r2 = r0 + r1
2*: r3 = load[r2]
SoluCon: Landing Pad
9
...
0: r0 = load[a]
predict 0x0
2: resolve r3, A
...
A
B
C
D
marker 0x0
1: r2 = r0 + r1
2*: r3 = load[r2]
SoluCon: Landing Pad
9
...
0: r0 = load[a]
predict 0x0
2: resolve r3, A
...
A
B
C
D
marker 0x0
1: r2 = r0 + r1
2*: r3 = load[r2]
SoluCon: Landing Pad
9
...
0: r0 = load[a]
predict 0x0
2: resolve r3, A
...
A
B
C
D
marker 0x0
1: r2 = r0 + r1
2*: r3 = load[r2]
...
0: r0 = load[a]
predict 0x0
...
r2 = r0 + r1
jmp [r2]
SoluCon: Landing Pad
9
...
0: r0 = load[a]
predict 0x0
2: resolve r3, A
...
A
B
C
D
marker 0x0
1: r2 = r0 + r1
2*: r3 = load[r2]
...
0: r0 = load[a]
predict 0x0
...
r2 = r0 + r1
jmp [r2]
SoluCon: Landing Pad
9
...
0: r0 = load[a]
predict 0x0
2: resolve r3, A
...
A
B
C
D
marker 0x0
1: r2 = r0 + r1
2*: r3 = load[r2]
...
0: r0 = load[a]
predict 0x0
...
r2 = r0 + r1
jmp [r2]
SoluCon: Landing Pad
9
...
0: r0 = load[a]
predict 0x0
2: resolve r3, A
...
A
B
C
D
marker 0x0
1: r2 = r0 + r1
2*: r3 = load[r2]
...
0: r0 = load[a]
predict 0x0
...
r2 = r0 + r1
jmp [r2]
...
0: r0 = load[a]
predict 0x0
...
0: Prediction Point
r2 = r0 + r1
jmp [r2]
SoluCon: Landing Pad
9
...
0: r0 = load[a]
predict 0x0
2: resolve r3, A
...
A
B
C
D
marker 0x0
1: r2 = r0 + r1
2*: r3 = load[r2]
...
0: r0 = load[a]
predict 0x0
...
r2 = r0 + r1
jmp [r2]
...
0: r0 = load[a]
predict 0x0
...
0: Prediction Point
r2 = r0 + r1
jmp [r2]
Invalid Target: M
M
N
O
P
Fails 0x0 Check
SoluCon: Landing Pad
9
...
0: r0 = load[a]
predict 0x0
2: resolve r3, A
...
A
B
C
D
marker 0x0
1: r2 = r0 + r1
2*: r3 = load[r2]
...
0: r0 = load[a]
predict 0x0
...
r2 = r0 + r1
jmp [r2]
...
0: r0 = load[a]
predict 0x0
...
0: Prediction Point
r2 = r0 + r1
jmp [r2]
Invalid Target: M
M
N
O
P
Fails 0x0 Check
SoluCon: Landing Pad
9
...
0: r0 = load[a]
predict 0x0
2: resolve r3, A
...
A
B
C
D
marker 0x0
1: r2 = r0 + r1
2*: r3 = load[r2]
...
0: r0 = load[a]
predict 0x0
...
r2 = r0 + r1
jmp [r2]
...
0: r0 = load[a]
predict 0x0
...
0: Prediction Point
r2 = r0 + r1
jmp [r2]
Invalid Target: M
M
N
O
P
Fails 0x0 Check
SoluCon: Landing Pad
9
...
0: r0 = load[a]
predict 0x0
2: resolve r3, A
...
A
B
C
D
marker 0x0
1: r2 = r0 + r1
2*: r3 = load[r2]
...
0: r0 = load[a]
predict 0x0
...
r2 = r0 + r1
jmp [r2]
...
0: r0 = load[a]
predict 0x0
...
0: Prediction Point
r2 = r0 + r1
jmp [r2]
Invalid Target: M
M
N
O
P
Fails 0x0 Check
SoluCon: Landing Pad
9
...
0: r0 = load[a]
predict 0x0
2: resolve r3, A
...
A
B
C
D
marker 0x0
1: r2 = r0 + r1
2*: r3 = load[r2]
...
0: r0 = load[a]
predict 0x0
...
r2 = r0 + r1
jmp [r2]
...
0: r0 = load[a]
predict 0x0
...
0: Prediction Point
r2 = r0 + r1
jmp [r2]
Invalid Target: M
M
N
O
P
Fails 0x0 Check
r2 = r0 + r1
jmp [r2]
Redirect Fetch
No Predict: Stall Fetch
SoluCon: Landing Pad
9
...
0: r0 = load[a]
predict 0x0
2: resolve r3, A
...
A
B
C
D
marker 0x0
1: r2 = r0 + r1
2*: r3 = load[r2]
...
0: r0 = load[a]
predict 0x0
...
r2 = r0 + r1
jmp [r2]
...
0: r0 = load[a]
predict 0x0
...
0: Prediction Point
r2 = r0 + r1
jmp [r2]
Invalid Target: M
M
N
O
P
Fails 0x0 Check
r2 = r0 + r1
jmp [r2]
Redirect Fetch
No Predict: Stall Fetch
SoluCon: Landing Pad
9
...
0: r0 = load[a]
predict 0x0
2: resolve r3, A
...
A
B
C
D
marker 0x0
1: r2 = r0 + r1
2*: r3 = load[r2]
...
0: r0 = load[a]
predict 0x0
...
r2 = r0 + r1
jmp [r2]
...
0: r0 = load[a]
predict 0x0
...
0: Prediction Point
r2 = r0 + r1
jmp [r2]
Invalid Target: M
M
N
O
P
Fails 0x0 Check
r2 = r0 + r1
jmp [r2]
Redirect Fetch
No Predict: Stall Fetch
Necessitates some changes to indirect call /return handling
Recovery From MispredicCon
10
Recovery From MispredicCon
10
2: resolve r3, A, RC
A
B
C
D
1: r2 = r0 + r1
2*: r3 = load[r2]
(1) resolve
fails
(2)Resteer
RCaddr to
Front End
(3)Resteer
r3to Front
End
Recovery From MispredicCon
10
2: resolve r3, A, RC
A
B
C
D
1: r2 = r0 + r1
2*: r3 = load[r2]
(1) resolve
fails
(2)Resteer
RCaddr to
Front End
(3)Resteer
r3to Front
End
Recovery From MispredicCon
10
2: resolve r3, A, RC
A
B
C
D
1: r2 = r0 + r1
2*: r3 = load[r2]
(1) resolve
fails
(2)Resteer
RCaddr to
Front End
(3)Resteer
r3to Front
End
Recovery From MispredicCon
10
2: resolve r3, A, RC
A
B
C
D
1: r2 = r0 + r1
2*: r3 = load[r2]
(1) resolve
fails
(2)Resteer
RCaddr to
Front End
(3)Resteer
r3to Front
End
Recovery From MispredicCon
10
2: resolve r3, A, RC
A
B
C
D
1: r2 = r0 + r1
2*: r3 = load[r2]
(1) resolve
fails
(2)Resteer
RCaddr to
Front End
(3)Resteer
r3to Front
End
Recovery From MispredicCon
10
2: resolve r3, A, RC
A
B
C
D
1: r2 = r0 + r1
2*: r3 = load[r2]
(1) resolve
fails
(2)Resteer
RCaddr to
Front End
(3)Resteer
r3to Front
End
Recovery From MispredicCon
10
2: resolve r3, A, RC
A
B
C
D
1: r2 = r0 + r1
2*: r3 = load[r2]
(1) resolve
fails
(2)Resteer
RCaddr to
Front End
(3)Resteer
r3to Front
End
undo B
undo C
RC: undo A
undo D
(4) Fetch Recovery
Code
(5) r3 value from
Resteer
r3: E
...
jmp r3
F
(6) Fetch Correct
Path starting at
r3
Recovery From MispredicCon
10
2: resolve r3, A, RC
A
B
C
D
1: r2 = r0 + r1
2*: r3 = load[r2]
(1) resolve
fails
(2)Resteer
RCaddr to
Front End
(3)Resteer
r3to Front
End
undo B
undo C
RC: undo A
undo D
(4) Fetch Recovery
Code
(5) r3 value from
Resteer
r3: E
...
jmp r3
F
(6) Fetch Correct
Path starting at
r3
Recovery From MispredicCon
10
2: resolve r3, A, RC
A
B
C
D
1: r2 = r0 + r1
2*: r3 = load[r2]
(1) resolve
fails
(2)Resteer
RCaddr to
Front End
(3)Resteer
r3to Front
End
undo B
undo C
RC: undo A
undo D
(4) Fetch Recovery
Code
(5) r3 value from
Resteer
r3: E
...
jmp r3
F
(6) Fetch Correct
Path starting at
r3
Recovery From MispredicCon
10
2: resolve r3, A, RC
A
B
C
D
1: r2 = r0 + r1
2*: r3 = load[r2]
(1) resolve
fails
(2)Resteer
RCaddr to
Front End
(3)Resteer
r3to Front
End
undo B
undo C
RC: undo A
undo D
(4) Fetch Recovery
Code
(5) r3 value from
Resteer
r3: E
...
jmp r3
F
(6) Fetch Correct
Path starting at
r3
Recovery From MispredicCon
10
2: resolve r3, A, RC
A
B
C
D
1: r2 = r0 + r1
2*: r3 = load[r2]
(1) resolve
fails
(2)Resteer
RCaddr to
Front End
(3)Resteer
r3to Front
End
undo B
undo C
RC: undo A
undo D
(4) Fetch Recovery
Code
(5) r3 value from
Resteer
r3: E
...
jmp r3
F
(6) Fetch Correct
Path starting at
r3
Recovery From MispredicCon
10
2: resolve r3, A, RC
A
B
C
D
1: r2 = r0 + r1
2*: r3 = load[r2]
(1) resolve
fails
(2)Resteer
RCaddr to
Front End
(3)Resteer
r3to Front
End
undo B
undo C
RC: undo A
undo D
(4) Fetch Recovery
Code
(5) r3 value from
Resteer
r3: E
...
jmp r3
F
(6) Fetch Correct
Path starting at
r3
Hardware Requirements
11
History
PC HASH
Write Port
Predictor Table
Read Port
DBB Tail Pointer
New
Indices
DBB
To Fetch Unit
Indices
Prediction State
Fetch BufferTo Back-End
Old
predict instruction: Insert New DBB Entry
Existing BPU
• Maintain correspondence between predic<on/resolu<on• Small FIFO: not many branches outstanding and no reordering of predic<on and resolu<on instruc<ons–Dovetails with exis<ng structures for outstanding branches• Single pointer, single R/W port
Hardware Requirements
11
History
PC HASH
Write Port
Predictor Table
Read Port
DBB Tail Pointer
New
Indices
DBB
To Fetch Unit
Indices
Prediction State
Fetch BufferTo Back-End
Old
predict instruction: Insert New DBB Entry
Existing BPU
• Maintain correspondence between predic<on/resolu<on• Small FIFO: not many branches outstanding and no reordering of predic<on and resolu<on instruc<ons–Dovetails with exis<ng structures for outstanding branches• Single pointer, single R/W port
Hardware Requirements
11
History
PC HASH
Write Port
Predictor Table
Read Port
DBB Tail Pointer
New
Indices
DBB
To Fetch Unit
Indices
Prediction State
Fetch BufferTo Back-End
Old
predict instruction: Insert New DBB Entry
Existing BPU
• Maintain correspondence between predic<on/resolu<on• Small FIFO: not many branches outstanding and no reordering of predic<on and resolu<on instruc<ons–Dovetails with exis<ng structures for outstanding branches• Single pointer, single R/W port
Hardware Requirements
11
History
PC HASH
Write Port
Predictor Table
Read Port
DBB Tail Pointer
New
Indices
DBB
To Fetch Unit
Indices
Prediction State
Fetch BufferTo Back-End
Old
predict instruction: Insert New DBB Entry
Existing BPU
• Maintain correspondence between predic<on/resolu<on• Small FIFO: not many branches outstanding and no reordering of predic<on and resolu<on instruc<ons–Dovetails with exis<ng structures for outstanding branches• Single pointer, single R/W port
Hardware Requirements
11
History
PC HASH
Write Port
Predictor Table
Read Port
DBB Tail Pointer
New
Indices
DBB
To Fetch Unit
Indices
Prediction State
Fetch BufferTo Back-End
Old
predict instruction: Insert New DBB Entry
Existing BPU
• Maintain correspondence between predic<on/resolu<on• Small FIFO: not many branches outstanding and no reordering of predic<on and resolu<on instruc<ons–Dovetails with exis<ng structures for outstanding branches• Single pointer, single R/W port
Hardware Requirements
11
History
PC HASH
Write Port
Predictor Table
Read Port
DBB Tail Pointer
New
Indices
DBB
To Fetch Unit
Indices
Prediction State
Fetch BufferTo Back-End
Old
predict instruction: Insert New DBB Entry
Existing BPU
• Maintain correspondence between predic<on/resolu<on• Small FIFO: not many branches outstanding and no reordering of predic<on and resolu<on instruc<ons–Dovetails with exis<ng structures for outstanding branches• Single pointer, single R/W port
Hardware Requirements
11
History
PC HASH
Write Port
Predictor Table
Read Port
DBB Tail Pointer
New
Indices
DBB
To Fetch Unit
Indices
Prediction State
Fetch BufferTo Back-End
Old
predict instruction: Insert New DBB Entry
Existing BPU
• Maintain correspondence between predic<on/resolu<on• Small FIFO: not many branches outstanding and no reordering of predic<on and resolu<on instruc<ons–Dovetails with exis<ng structures for outstanding branches• Single pointer, single R/W port
Hardware Requirements
12
• Maintain correspondence between predic<on/resolu<on• Small FIFO: not many branches outstanding and no reordering of predic<on and resolu<on instruc<ons–Dovetails with exis<ng structures for outstanding branches
• Single pointer, single R/W port
History
PC HASH
Write Port
Predictor Table
Read Port
DBB Tail Pointer
Inde
x
Indices
DBB
To Fetch Unit
Indices
Prediction State
Fetch Buffer
Insert DBB Index Into Branch Resolution InstructionTo Back-End
Existing BPU
Hardware Requirements
12
• Maintain correspondence between predic<on/resolu<on• Small FIFO: not many branches outstanding and no reordering of predic<on and resolu<on instruc<ons–Dovetails with exis<ng structures for outstanding branches
• Single pointer, single R/W port
History
PC HASH
Write Port
Predictor Table
Read Port
DBB Tail Pointer
Inde
x
Indices
DBB
To Fetch Unit
Indices
Prediction State
Fetch Buffer
Insert DBB Index Into Branch Resolution InstructionTo Back-End
Existing BPU
Hardware Requirements
12
• Maintain correspondence between predic<on/resolu<on• Small FIFO: not many branches outstanding and no reordering of predic<on and resolu<on instruc<ons–Dovetails with exis<ng structures for outstanding branches
• Single pointer, single R/W port
History
PC HASH
Write Port
Predictor Table
Read Port
DBB Tail Pointer
Inde
x
Indices
DBB
To Fetch Unit
Indices
Prediction State
Fetch Buffer
Insert DBB Index Into Branch Resolution InstructionTo Back-End
Existing BPU
Hardware Requirements
12
• Maintain correspondence between predic<on/resolu<on• Small FIFO: not many branches outstanding and no reordering of predic<on and resolu<on instruc<ons–Dovetails with exis<ng structures for outstanding branches
• Single pointer, single R/W port
History
PC HASH
Write Port
Predictor Table
Read Port
DBB Tail Pointer
Inde
x
Indices
DBB
To Fetch Unit
Indices
Prediction State
Fetch Buffer
Insert DBB Index Into Branch Resolution InstructionTo Back-End
Existing BPU
Hardware Requirements
13
History
PC HASH
Write Port
Predictor Table
Read Port
Indices
DBB
Resteer Logic To Fetch Unit
From Back-End
DBB Index from Branch Resolution Instruction
New Prediction State
Indices
• Maintain correspondence between predic<on/resolu<on• Small FIFO: not many branches outstanding and no reordering of predic<on and resolu<on instruc<ons–Dovetails with exis<ng structures for outstanding branches• Single pointer, single R/W port
Hardware Requirements
13
History
PC HASH
Write Port
Predictor Table
Read Port
Indices
DBB
Resteer Logic To Fetch Unit
From Back-End
DBB Index from Branch Resolution Instruction
New Prediction State
Indices
• Maintain correspondence between predic<on/resolu<on• Small FIFO: not many branches outstanding and no reordering of predic<on and resolu<on instruc<ons–Dovetails with exis<ng structures for outstanding branches• Single pointer, single R/W port
Hardware Requirements
13
History
PC HASH
Write Port
Predictor Table
Read Port
Indices
DBB
Resteer Logic To Fetch Unit
From Back-End
DBB Index from Branch Resolution Instruction
New Prediction State
Indices
• Maintain correspondence between predic<on/resolu<on• Small FIFO: not many branches outstanding and no reordering of predic<on and resolu<on instruc<ons–Dovetails with exis<ng structures for outstanding branches• Single pointer, single R/W port
Hardware Requirements
13
History
PC HASH
Write Port
Predictor Table
Read Port
Indices
DBB
Resteer Logic To Fetch Unit
From Back-End
DBB Index from Branch Resolution Instruction
New Prediction State
Indices
• Maintain correspondence between predic<on/resolu<on• Small FIFO: not many branches outstanding and no reordering of predic<on and resolu<on instruc<ons–Dovetails with exis<ng structures for outstanding branches• Single pointer, single R/W port
Hardware Requirements
13
History
PC HASH
Write Port
Predictor Table
Read Port
Indices
DBB
Resteer Logic To Fetch Unit
From Back-End
DBB Index from Branch Resolution Instruction
New Prediction State
Indices
• Maintain correspondence between predic<on/resolu<on• Small FIFO: not many branches outstanding and no reordering of predic<on and resolu<on instruc<ons–Dovetails with exis<ng structures for outstanding branches• Single pointer, single R/W port
Hardware Requirements
13
History
PC HASH
Write Port
Predictor Table
Read Port
Indices
DBB
Resteer Logic To Fetch Unit
From Back-End
DBB Index from Branch Resolution Instruction
New Prediction State
Indices
• Maintain correspondence between predic<on/resolu<on• Small FIFO: not many branches outstanding and no reordering of predic<on and resolu<on instruc<ons–Dovetails with exis<ng structures for outstanding branches• Single pointer, single R/W port
Experimental Methodology
14
• Profile Guided Op<miza<on LLVM 3.5
• Benchmarks with 0.3% dynamic instruc<on stream indir branch
• SPEC TRAIN input, PHP and Python first input• Cycle Accurate x86 simulator PTLSim provides predictability• Transform non-‐loop branches with pred > bias and (pred -‐ bias) > 3%
• Run SPEC, PHP, Python on PTLSim using PGO binaries with and without predic<on guide on REF
• Specula<on support : (alias/shadow registers) for baseline and experimental
• Improved Indirect Branch Predictor: VPC (TAGE study in paper)•
second approach would be to mark the DBB entries as invalid on one of these exceptional control flow events and suppress any
branch predictor updates using an invalid DBB entry.
We size the DBB empirically. In practice, the number of outstanding decomposed branches is small; there tends to be
significant back-pressure in the in-order due to head-of-line blocking in the issue stage. We found that 16 entries were more than
sufficient, resulting in a 4-bit DBB index that needs to be stored with resolution instructions. In our implementation, each DBB
entry contains 24 bits: 16 bits for the indices into the branch prediction table hierarchy and 8 bits for the prediction metadata.
The DBB itself requires one read and one write port
7. Experimental Evaluation
In our evaluation, we use LLVM 3.5, PTLSim [44], and the SPEC 2006 and 2000 benchmark suites. As it is not profitable to
decompose all branches, we use a profile-guided strategy to select the set of branches to transform. We run the TRAIN input
sets to completion in PTLSim to collect branch bias and predictability. We transform forward branches1 whose predictability
exceeds bias by at least 5%; this heuristic provided the best overall performance. Code generation is handled by LLVM at the
-O3 level with PGO and SSE2 vectorization.
For performance evaluation, we use REF input sets. We select hot spots for simulation using Intel’s Software Development
Emulator [5]; simulation is up to 10B dynamic instructions. We simulate three different width in-order superscalars (2-wide,
4-wide, and 8-wide, with configuration parameters shown in Table 1). For our experimental system, PTLSim was extended with
a decoupled branch buffer (DBB, as described in Section 6) as well as the support for speculative compilation as discussed in
Section 4.2, including “shadow registers” which allows the compiler to re-use architectural registers for speculative computations
and commit them when resolution instructions commit.
Key Structures Configuration ParametersBpred PTLSim default: GShare, 24 KB 3-table direction
predictor, 4K-entry BTB, 64-entry RASFront-End 5 stages, Experimentally Varied 2/4/8 wide
Fetch/Decode/Dispatch, 32-entry FetchBufferExecution Ports Experimentally Varied 2/4/8Functional Units Up to 2 x LD/ST, 2 x INT/SIMD-Permute, 4 x 64-bit
SIMD/FP, 1-cycle bypassL1 Caches 8-way 32 KB L1-D$, 4-way 32 KB L1-I$, 64B lines,
4-cycle latencyL2 Cache 16-way 256KB Unified, 12-cycle latencyL3 Cache 32-way 4MB LLC, 25-cycle latencyMiss Handling 64-entry Miss Buffer, 64-entry Load Fill Request
QueueMain Memory 140-cycle latency
Table 1: Machine Configuration Parameters
1Backwards-taken branches i.e. loop branches tend to be highly biased and highly predictable and are ably handled by well-known loop transformations suchas modulo scheduling.
14
Performance: SPEC95, SPEC2K, SPEC2K6
• Performance propor<onal to % of dynamic instruc<ons which are indirect branches (PDS) and amenable to transforma<on (weighted averaged bias: WAB)• richards: 4% PDS, 41% WAB • m88ksim: 0.4% PDS, 57% WAB• Geomean: 1.1% PDS, 62% WAB 15
0%
6%
12%
18%
24%
30%
gcc95 li
m88ksi
mperl
95 eon
gapgcc
2kperl
2k
gcc2k
6
h264
ref
omnetpp
perl2k
6povra
ysje
ngxal
anc
richa
rds
Geomea
n
2-wide 4-wide
Performance: PHP
• Specnorm: 1.4% PDS, 18% WAB• run: 0.6% PDS, 49% WAB• Geomean: 1.4% PDS, 37% WAB
16
0%
10%
20%
30%
40%
50%
bench
btree
fannk
uch
fasta
fastar
edux
knuk
e
mandelb
rotnb
ody
regex
dna
revco
mp run
spec
norm
Geomea
n
2-wide 4-wide
Performance: Python
• regexdna: 2.7% PDS, 72% WAB• revcomp: 1.0% PDS, 90% WAB• Geomean: 1.8% PDS, 73% WAB
17
0%
5%
10%
15%
20%
btree
fannk
uch
fasta
float
knuk
eman
delbr
otmete
ornb
ody
quee
nsray
trace
regex
dna
revco
mpric
hards
spec
norm
Geomea
n
Spee
dup
2-wide 4-wide
Efficiency
18
0%
1%
2%
3%
4%
gcc95 li
m88ksimperl95 eon gapgcc2k
perl2kgcc2k6
h264ref
omentppperl2k6
povray
sjengxalanc
richards
Geomean
SPEC
Extr
a In
strs
Issu
ed
0%
1%
2%
3%
4%
btree
fannkuchfasta float
knuke
mandelbrotmeteor
nbodyqueens
raytrace
regexdna
revcomprichards
specnorm
Geomean
Python
Extr
a In
strs
Issu
ed
0%
1%
2%
3%
4%
benchbtree
fannkuchfasta
fastaredux
knuke
mandelbrotnbody
regexdna
revcomp run
specnorm
Geomean
PHP
Extr
a In
strs
Issu
ed
Better
Conclusions• Straighaorward, low-‐cost “enabling” transforma<on–Leverages DBT profiling and specula<on facili<es
• Modest Hardware Requirements• Leverages Advances In Indirect Branch Predic<on• Good Performance across Integer and Floa<ng Point• Maintains the Efficiency of the In-‐Order
19
Thank You
20
Thank You
20
QuesCons?