cases 2007 florida state university chris zimmer , steve hines, prasad kulkarni

Cases 2007

Florida State University

Chris Zimmer, Steve Hines, Prasad Kulkarni

Gary Tyson, David Whalley

Facilitating Compiler Optimizations Through the

Dynamic Mapping of Alternate Register Structures

Motivation

2

Embedded Processors have fewer registers.

Compiler Optimizations increase register pressure

Difficult to apply aggressive compiler optimizations on embedded systems

Vector Multiply Example

3

Even before aggressive optimizations, 60% of available registers are already used

Further optimizations like Loop Unrolling and Software Pipelining are inhibited

int A[1000], B[1000];void vmul() { int I; for (I=2; I < 1000; I++) B[I] = A[I] * B[I-2];}

.L3: ldr r1,[r2,r3, lsl #2] ldr r12,[r4], #4 mul r0,r12,r1 str r0,[r5,r3, lsl #2] add r3,r3,#1 cmp r3, #1000 blt .L3

Application Configurable Processors

4

Exploit common reference patterns found in code

Small register files mimic these reference behaviors.

Map Table provides register redirection.Changed architecture to add more

registers, but have minimal impact on ISA support, particularly not increasing operand size

Architectural Modifications

5

RegisterFile

Queue Q1

Queue Q2

Queue Q3

Stack Q4

Circular Buffer Q5

MapTable

R6 R6

R0 R0

R1 Q1

R15 R15

Software Pipelining

6

Software pipelining is not often found in embedded compilers.

Software pipelining reduces the overall cycle time of a loop.

Extracts iterations

Consumes Stalls

Consumes registers!!

Software Pipelining Example

7

Stalls Present when Loop Run

.L3:

ldr r1,[r2,r3, lsl #2]

ldr r12,[r4], #4

stall

stall

stall

mul r0,r12,r1

stall

stall

stall

str r0,[r5,r3, lsl #2]

add r3,r3,#1

cmp r3, #1000

bgt .L3

int A[1000], B[1000];void vmul() { int I; for (I=2; I < 1000; I++) B[I] = A[I] * C[I];}

.L3:


ldr r12,[r4], #4

mul r0,r12,r1

str r0,[r5,r3, lsl #2]

add r3,r3,#1

cmp r3, #1000

blt .L3

Instruction

8

Goal: Minimal modification to existing instruction set.

Single cycle instruction latencyMethod: Add a single instruction to the

ISA that is used to map and unmap a common register specifier into a customized register structure.

qmap <Reg Specifier> <Custom reg map information> <Custom reg specifier>

qmap r3,#4,q3

Architectural Modifications

9

RegisterFile

Queue Q1

Queue Q2

Queue Q3

Destructive Queue Q4

Circular Buffer Q5

MapTable

R6 R6

R0

An access to R0, which has no mapping in the table would get the data from the register file.

R1 is mapped into Q1 and would retrieve its data from there.

R0

R1 Q1

R15 R15

4

30

Software Pipelining Example

10

1525

Q1

Q2

530

Q3

int A[1000], B[1000];void vmul() { int I; for (I=2; I < 1000; I++) B[I] = A[I] * C[I];}

5

2 13

75

Register Usage

11

Benchmark AR in Original Loop AR needed to Pipeline AR contained in customized structuresN Real Updates 10 10 6Dot Product 9 9 4Matrix Multiply 9 9 4Fir 6 6 4Mac 10 8 10Fir2Dim 3 Similar Loops 10 10 4

N Real Updates 10 10 6Dot Product 9 9 4Matrix Multiply 9 9 4Fir 6 6 4Mac 10 8 12Fir2Dim 10 10 4

N Real Updates 10 10 9Dot Product 9 9 8Matrix Multiply 9 9 8Fir 6 6 12Mac 10 8 18Fir2Dim 10 10 8

Loads 16x4 Register Savings Using Register Structures



Results – Multiplies varying latency, load latency set at four

12

In-Order Issue

0

10

20

30

40

50

2 4 8 16 32

Multiply Latency

Pe

rce

nt

Cy

cle

Re

du

cti

on

Dot Product

Matrix

Fir

N Real Updates

Conv 45

Mac

Fir2Dim

Results – Loads varying latency, multiply latency set at four

13

In-Order Issue

-10

0

10

20

30

40

50

60

2 4 8 16 32

Load Latency

Pe

rce

nt

Cy

cle

Re

du

cti

on

Dot Product

Matrix

Fir

N Real Upates

Conv45

Mac

Fir2Dim

Conclusions

14

Customized register structures reduce register pressure.

Software pipelining is viable in resource constrained environments

Performance can be improved with minor impact to the ISA.

Extra’s

Reference Behaviors

16


ldr r12,[r6,r4, lsl #8]

ldr r8,[r6,r4, lsl #12]

str r8,[r3,r4, lsl #16]

str r12,[r3,r4, lsl #20]

str r1,[r3,r4, lsl #24]

Stack Reference Behavior

Application Configurable Architecture

17

Application configurable processors are designed using a mapping table similar to a register rename table found in many out of order implementations.

The map table is read during every access to the architected register file.

This serves as a method of determining if a register specifier is used in the original architected register file or a customized register structure.

Application Configurable Architecture

18

The customized register files are small in size but they efficiently manage the values that would require many architected registers.

The customized register files can mimic queues, stacks, and circular buffers.

These structures are accessed using the same register specifier that is used to access the architected register file.

RemoveReference Behaviors

19


ldr r12,[r6,r4, lsl #8]

ldr r8,[r6,r4, lsl #12]

str r8,[r3,r4, lsl #16]

str r12,[r3,r4, lsl #20]

str r1,[r3,r4, lsl #24]

Stack Reference

Behavior

R8

R12

R1

r1



ldr r1,[r6,r4, lsl #12]

str r1,[r3,r4, lsl #16]

str r1,[r3,r4, lsl #20]

str r1,[r3,r4, lsl #24]

Free up r8 and r12 for use.

RemoveQmap Instruction

20

R8

R12

R1

q0

Free up r8 and r12 for use.

Modulo Scheduling

21

For our work we used modulo scheduling. This requires using the dependences and latencies of the loop instructions to generate a modulo scheduled loop.

The prolog and epilog are then built based off of this schedule.

The prolog and epilog in require register renaming of loop carried dependencies to verify a correct loop. Renaming in embedded processors is often not

possible.

Register Renaming due to software pipelining

22

Renaming doesn’t work… not enough registers.

Rotating registers would require a significant rewrite of the embedded ISA.

The loop carried values can simply be mapped into a register queue to hold the value across several iterations.

Results Register Savings

23

As latency grows for the instructions more iterations of the loop are extracted to spread out the latency.

The extra registers that would be required to perform renaming have measured from 25% to 200% of the available registers in the ARM.

cases 2007 florida state university chris zimmer , steve hines, prasad kulkarni

Documents