the power of belady ’ s algorithm in register allocation for long basic blocks jia guo, maría...

21
The Power of Belady’s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran, padua @ uiuc.edu University of Illinois at Urbana Champaign

Post on 21-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

The Power of Belady’s Algorithm in Register Allocation for Long Basic Blocks

Jia Guo, María Jesús Garzarán and David Padua

jiaguo, garzaran, padua @ uiuc.eduUniversity of Illinois at Urbana Champaign

LCPC2003 2

Motivation

Long basic blocksCompiler

optimizationsLibrary generators

SPIRAL, ATLAS0

0.5

1

1.5

2

2.5

3

FFT16 FFT32 FFT64Sp

eedu

p

SPARC MIPS

Speedup obtained from unrolling FFTs

The effectiveness of register allocation on long basic blocks

LCPC2003 3

Contributions

Apply Belady’s MIN algorithm to long basic blocks

Compared with MIPSPro, Belady’s MIN algorithm performs10% faster on Matrix Multiplication code12% faster on FFT code of size 3233% faster on FFT code of size 64

LCPC2003 4

Outline

Belady’s MIN algorithmOn FFT codeOn Matrix Multiplication codeConclusions

LCPC2003 5

Belady’s MIN Algorithm

Chooses the farthest next useGuarantees the minimum number of reloads

but not the stores.

1 c = a + b

2 e = a + d3 f = a - b4 h = d + c

3 FP registersload R1, a load R2, badd R3, R1, R2store R3, cload R3, d… …

Reg R1 R2 R3Var a b c

Next use 2 3 4Status clean clean dirty

spill

d

4

clean

LCPC2003 6

Belady’s algorithm

A Simple Compiler

A Simple Compiler

Long Basic Blocks

Parsing

Register Allocation

MIPS assembly code

Target code generation

SPIRAL after optimizations

ATLAS after optimizations

LCPC2003 7

Outline

Belady’s MIN algorithmOn FFT codeOn Matrix Multiplication codeConclusions

LCPC2003 8

SPIRAL and FFT Code

Make use of formula transformations and intelligent search to generate optimized DSP libraries

Search for the best degree of unrolling Small size (2 to 64)

Straight line code FFT64: 1400+ stmts

large size (128+) use small size results as components.

rsrsr

rsssrrs LFITIFF )()( Fr Fs

LCPC2003 9

Interesting Patterns in FFT Codes… …y32 = y2 - y18y33 = y3 - y19y34= y2 + y18y35 = y3 + y19y36 = y10 - y26y37 = y11 - y27y38= y10 + y26y39 = y11 + y27

y40 = y34 - y38y41 = y35 - y39

y42 = y34+y38y43 = y35 + y39y44 = y32 - y37y45 = y33 + y36… …

One define Two uses Close uses

Simplify: One-define, one-use program

It can be proved that Belady’s MIN algorithm generates the minimum

number of reloads and stores!

LCPC2003 10

Performance Evaluation: FFT

Speedup: FFT 32: 12% FFT 64: 33%

0

100

200

300

400

500

600

700

4 8 16 32 64

FFT size

MF

lop

s

MIN MIPSPro G77

no spill spills

Performance of the best formula for FFT 4-64

3

3.5

4

4.5

5

5.5

6

6.5

7

1 26 51 76 101 126 151 176

Different formulas for FFT of size 64

Exe

cuti

on ti

me

in m

icro

seco

nds

MIN MIPSPro

Performance of all the formulas for FFT 64

LCPC2003 11

Outline

Belady’s MIN algorithmOn FFT codeOn Matrix Multiplication codeConclusions

LCPC2003 12

ATLAS and Matrix Multiplication

An empirical optimizer searching for best parameters (degree of unrolling, etc.)

Study: innermost loop body When KU=64

NU:MU=2:2 to 8:8

300-4000 LOC

A

MU

KU

B

NU

KU

Tile Size

CTile Size

for (int j =0; j<TileSize; j+=NU) for (int i=0; i<TileSize; i+=MU) load C1..8 into registers for (int k=0; k<TileSize; k+=KU) load A k, 1..4 into registers

load B 1..2, k into registers C1 += A k, 1 * B 1, k

C2 += A k, 1 * B 2, k

… …

C8 += Ak, 4 * B2,k

Repeat * for k+1, k+KU-1 store C back to memory

*

LCPC2003 13

Performance Evaluation for MM

50

100

150

200

250

300

350

400

450

500

2x2 2x4 4x2 4x4 4x5 5x4 4x6 6x4 5x5 5x6 6x5 4x8 8x4 6x6 6x8 8x6 7x7 7x8 8x7 8x8

Degree of unroll

MF

lops

MIN MIPSPro GCC

no spill spillsSpills for NU:MU = 4:8MIPSPro: 438MIN: 892

LCPC2003 14

Explanation

Keep spilling c elements More stores

Long dependency chain

1 c 0 += a0 * b0 2 c 1 += a1 * b0 3 c 2 += a2 * b0 4 c 3 += a3 * b0 5 c 64 += a0 *

b64 6 c 65 += a1 *

b64 7 c 66 += a2 *

b64 8 c 67 += a3 *

b64

9 c 0 += a64* b110 c 1 += a65* b111 c 2 += a66* b112 c 3 += a67* b113 c 64 += a64*

b6514 c 65 += a65*

b6515 c 66 += a66 *

b6516 c 67 += a67 *

b65

Originalload R0, a0

load R1, b0

load R2, c0

1 madd R2, R2, R0, R1

load R3, a1

load R4, c1

2 madd R4, R4, R3, R1

load R5, a2

store R4, c1

load R4, c2

3 madd R4, R4, R5, R1

store R4, c2

load R4, a3

store R2, c0

load R2, c3

4 madd R2, R2, R4, R1

… …

LCPC2003 15

Solution

Spill a, b, c elementsLess stores

Break dependency chainUse the instruction

scheduling from MIPSPro

c 0 += a0 * b0c 0 += a64* b1c 1 += a65* b1c 1 += a1 * b0c 2 += a2 * b0c 3 += a3 * b0c 2 += a66* b1c 3 += a67* b1

c 66 += a2 * b64c 64 += a0 * b64c 66 += a67 * b65c 64 += a64* b65c 65 += a1 * b64c 67 += a3 * b64c 67 += a68 * b65c 65 += a65* b65

Scheduled by MIPSPro

LCPC2003 16

Performance Evaluation for MM

10% better when mu:nu are larger than 4:6

50

100

150

200

250

300

350

400

450

500

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Degree of unroll

MF

lops

MIN MIPSPro MINSched

no spill spillsSpills for NU:MU = 4:8MINSched: 356MIPSPro: 438MIN: 892

LCPC2003 17

Outline

Belady’s MIN algorithmOn FFT codeOn Matrix Multiplication codeConclusions

LCPC2003 18

Conclusions

Belady’s MIN algorithmGenerates minimum number of reloads and

stores for one-define and one-use problemPerforms better than the state of the art

compilers like MIPSPro and GCCSpeedup: 12% (FFT32), 33%(FFT64), 10%(MM)

Further benchmarks to be tested

LCPC2003 20

Code Size after Register Allocation

0

500

1000

1500

2000

2500

3000

FFT16 FFT32 FFT64

LOC

Before Af ter

LCPC2003 21

Speedup from Fully Unrolling

0

0. 5

1

1. 5

2

2. 5

3

3. 5

1 14 27 40 53 66 79 92 105 118 131 144 157 170 183 196

Di ff erent FFT formul as

Speedup

FFT16 FFT32 FFT64

Average speedup

FFT16: 2. 5769FFT32: 2. 1478FFT64: 1. 7347