the power of belady ’ s algorithm in register allocation for long basic blocks jia guo, maría...
Post on 21-Dec-2015
216 views
TRANSCRIPT
The Power of Belady’s Algorithm in Register Allocation for Long Basic Blocks
Jia Guo, María Jesús Garzarán and David Padua
jiaguo, garzaran, padua @ uiuc.eduUniversity of Illinois at Urbana Champaign
LCPC2003 2
Motivation
Long basic blocksCompiler
optimizationsLibrary generators
SPIRAL, ATLAS0
0.5
1
1.5
2
2.5
3
FFT16 FFT32 FFT64Sp
eedu
p
SPARC MIPS
Speedup obtained from unrolling FFTs
The effectiveness of register allocation on long basic blocks
LCPC2003 3
Contributions
Apply Belady’s MIN algorithm to long basic blocks
Compared with MIPSPro, Belady’s MIN algorithm performs10% faster on Matrix Multiplication code12% faster on FFT code of size 3233% faster on FFT code of size 64
LCPC2003 5
Belady’s MIN Algorithm
Chooses the farthest next useGuarantees the minimum number of reloads
but not the stores.
1 c = a + b
2 e = a + d3 f = a - b4 h = d + c
3 FP registersload R1, a load R2, badd R3, R1, R2store R3, cload R3, d… …
Reg R1 R2 R3Var a b c
Next use 2 3 4Status clean clean dirty
spill
d
4
clean
LCPC2003 6
Belady’s algorithm
A Simple Compiler
A Simple Compiler
Long Basic Blocks
Parsing
Register Allocation
MIPS assembly code
Target code generation
SPIRAL after optimizations
ATLAS after optimizations
LCPC2003 8
SPIRAL and FFT Code
Make use of formula transformations and intelligent search to generate optimized DSP libraries
Search for the best degree of unrolling Small size (2 to 64)
Straight line code FFT64: 1400+ stmts
large size (128+) use small size results as components.
rsrsr
rsssrrs LFITIFF )()( Fr Fs
LCPC2003 9
Interesting Patterns in FFT Codes… …y32 = y2 - y18y33 = y3 - y19y34= y2 + y18y35 = y3 + y19y36 = y10 - y26y37 = y11 - y27y38= y10 + y26y39 = y11 + y27
y40 = y34 - y38y41 = y35 - y39
y42 = y34+y38y43 = y35 + y39y44 = y32 - y37y45 = y33 + y36… …
One define Two uses Close uses
Simplify: One-define, one-use program
It can be proved that Belady’s MIN algorithm generates the minimum
number of reloads and stores!
LCPC2003 10
Performance Evaluation: FFT
Speedup: FFT 32: 12% FFT 64: 33%
0
100
200
300
400
500
600
700
4 8 16 32 64
FFT size
MF
lop
s
MIN MIPSPro G77
no spill spills
Performance of the best formula for FFT 4-64
3
3.5
4
4.5
5
5.5
6
6.5
7
1 26 51 76 101 126 151 176
Different formulas for FFT of size 64
Exe
cuti
on ti
me
in m
icro
seco
nds
MIN MIPSPro
Performance of all the formulas for FFT 64
LCPC2003 12
ATLAS and Matrix Multiplication
An empirical optimizer searching for best parameters (degree of unrolling, etc.)
Study: innermost loop body When KU=64
NU:MU=2:2 to 8:8
300-4000 LOC
A
MU
KU
B
NU
KU
Tile Size
CTile Size
for (int j =0; j<TileSize; j+=NU) for (int i=0; i<TileSize; i+=MU) load C1..8 into registers for (int k=0; k<TileSize; k+=KU) load A k, 1..4 into registers
load B 1..2, k into registers C1 += A k, 1 * B 1, k
C2 += A k, 1 * B 2, k
… …
C8 += Ak, 4 * B2,k
Repeat * for k+1, k+KU-1 store C back to memory
*
LCPC2003 13
Performance Evaluation for MM
50
100
150
200
250
300
350
400
450
500
2x2 2x4 4x2 4x4 4x5 5x4 4x6 6x4 5x5 5x6 6x5 4x8 8x4 6x6 6x8 8x6 7x7 7x8 8x7 8x8
Degree of unroll
MF
lops
MIN MIPSPro GCC
no spill spillsSpills for NU:MU = 4:8MIPSPro: 438MIN: 892
LCPC2003 14
Explanation
Keep spilling c elements More stores
Long dependency chain
1 c 0 += a0 * b0 2 c 1 += a1 * b0 3 c 2 += a2 * b0 4 c 3 += a3 * b0 5 c 64 += a0 *
b64 6 c 65 += a1 *
b64 7 c 66 += a2 *
b64 8 c 67 += a3 *
b64
9 c 0 += a64* b110 c 1 += a65* b111 c 2 += a66* b112 c 3 += a67* b113 c 64 += a64*
b6514 c 65 += a65*
b6515 c 66 += a66 *
b6516 c 67 += a67 *
b65
Originalload R0, a0
load R1, b0
load R2, c0
1 madd R2, R2, R0, R1
load R3, a1
load R4, c1
2 madd R4, R4, R3, R1
load R5, a2
store R4, c1
load R4, c2
3 madd R4, R4, R5, R1
store R4, c2
load R4, a3
store R2, c0
load R2, c3
4 madd R2, R2, R4, R1
… …
LCPC2003 15
Solution
Spill a, b, c elementsLess stores
Break dependency chainUse the instruction
scheduling from MIPSPro
c 0 += a0 * b0c 0 += a64* b1c 1 += a65* b1c 1 += a1 * b0c 2 += a2 * b0c 3 += a3 * b0c 2 += a66* b1c 3 += a67* b1
c 66 += a2 * b64c 64 += a0 * b64c 66 += a67 * b65c 64 += a64* b65c 65 += a1 * b64c 67 += a3 * b64c 67 += a68 * b65c 65 += a65* b65
Scheduled by MIPSPro
LCPC2003 16
Performance Evaluation for MM
10% better when mu:nu are larger than 4:6
50
100
150
200
250
300
350
400
450
500
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Degree of unroll
MF
lops
MIN MIPSPro MINSched
no spill spillsSpills for NU:MU = 4:8MINSched: 356MIPSPro: 438MIN: 892
LCPC2003 18
Conclusions
Belady’s MIN algorithmGenerates minimum number of reloads and
stores for one-define and one-use problemPerforms better than the state of the art
compilers like MIPSPro and GCCSpeedup: 12% (FFT32), 33%(FFT64), 10%(MM)
Further benchmarks to be tested
LCPC2003 20
Code Size after Register Allocation
0
500
1000
1500
2000
2500
3000
FFT16 FFT32 FFT64
LOC
Before Af ter