cases 2007 florida state university chris zimmer , steve hines, prasad kulkarni
DESCRIPTION
Facilitating Compiler Optimizations Through the Dynamic Mapping of Alternate Register Structures. Cases 2007 Florida State University Chris Zimmer , Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley. Motivation. Embedded Processors have fewer registers. - PowerPoint PPT PresentationTRANSCRIPT
Cases 2007
Florida State University
Chris Zimmer, Steve Hines, Prasad Kulkarni
Gary Tyson, David Whalley
Facilitating Compiler Optimizations Through the
Dynamic Mapping of Alternate Register Structures
Motivation
2
Embedded Processors have fewer registers.
Compiler Optimizations increase register pressure
Difficult to apply aggressive compiler optimizations on embedded systems
Vector Multiply Example
3
Even before aggressive optimizations, 60% of available registers are already used
Further optimizations like Loop Unrolling and Software Pipelining are inhibited
int A[1000], B[1000];void vmul() { int I; for (I=2; I < 1000; I++) B[I] = A[I] * B[I-2];}
.L3: ldr r1,[r2,r3, lsl #2] ldr r12,[r4], #4 mul r0,r12,r1 str r0,[r5,r3, lsl #2] add r3,r3,#1 cmp r3, #1000 blt .L3
Application Configurable Processors
4
Exploit common reference patterns found in code
Small register files mimic these reference behaviors.
Map Table provides register redirection.Changed architecture to add more
registers, but have minimal impact on ISA support, particularly not increasing operand size
Architectural Modifications
5
RegisterFile
Queue Q1
Queue Q2
Queue Q3
Stack Q4
Circular Buffer Q5
MapTable
R6 R6
R0 R0
R1 Q1
R15 R15
Software Pipelining
6
Software pipelining is not often found in embedded compilers.
Software pipelining reduces the overall cycle time of a loop.
Extracts iterations
Consumes Stalls
Consumes registers!!
Software Pipelining Example
7
Stalls Present when Loop Run
.L3:
ldr r1,[r2,r3, lsl #2]
ldr r12,[r4], #4
stall
stall
stall
mul r0,r12,r1
stall
stall
stall
str r0,[r5,r3, lsl #2]
add r3,r3,#1
cmp r3, #1000
bgt .L3
int A[1000], B[1000];void vmul() { int I; for (I=2; I < 1000; I++) B[I] = A[I] * C[I];}
.L3:
ldr r1,[r2,r3, lsl #2]
ldr r12,[r4], #4
mul r0,r12,r1
str r0,[r5,r3, lsl #2]
add r3,r3,#1
cmp r3, #1000
blt .L3
Instruction
8
Goal: Minimal modification to existing instruction set.
Single cycle instruction latencyMethod: Add a single instruction to the
ISA that is used to map and unmap a common register specifier into a customized register structure.
qmap <Reg Specifier> <Custom reg map information> <Custom reg specifier>
qmap r3,#4,q3
Architectural Modifications
9
RegisterFile
Queue Q1
Queue Q2
Queue Q3
Destructive Queue Q4
Circular Buffer Q5
MapTable
R6 R6
R0
An access to R0, which has no mapping in the table would get the data from the register file.
R1 is mapped into Q1 and would retrieve its data from there.
R0
R1 Q1
R15 R15
4
30
Software Pipelining Example
10
1525
Q1
Q2
530
Q3
int A[1000], B[1000];void vmul() { int I; for (I=2; I < 1000; I++) B[I] = A[I] * C[I];}
5
2 13
75
Register Usage
11
Benchmark AR in Original Loop AR needed to Pipeline AR contained in customized structuresN Real Updates 10 10 6Dot Product 9 9 4Matrix Multiply 9 9 4Fir 6 6 4Mac 10 8 10Fir2Dim 3 Similar Loops 10 10 4
N Real Updates 10 10 6Dot Product 9 9 4Matrix Multiply 9 9 4Fir 6 6 4Mac 10 8 12Fir2Dim 10 10 4
N Real Updates 10 10 9Dot Product 9 9 8Matrix Multiply 9 9 8Fir 6 6 12Mac 10 8 18Fir2Dim 10 10 8
Loads 16x4 Register Savings Using Register Structures
Loads 32x4 Register Savings Using Register Structures
Loads 8x4 Register Savings Using Register Structures
Results – Multiplies varying latency, load latency set at four
12
In-Order Issue
0
10
20
30
40
50
2 4 8 16 32
Multiply Latency
Pe
rce
nt
Cy
cle
Re
du
cti
on
Dot Product
Matrix
Fir
N Real Updates
Conv 45
Mac
Fir2Dim
Results – Loads varying latency, multiply latency set at four
13
In-Order Issue
-10
0
10
20
30
40
50
60
2 4 8 16 32
Load Latency
Pe
rce
nt
Cy
cle
Re
du
cti
on
Dot Product
Matrix
Fir
N Real Upates
Conv45
Mac
Fir2Dim
Conclusions
14
Customized register structures reduce register pressure.
Software pipelining is viable in resource constrained environments
Performance can be improved with minor impact to the ISA.
Reference Behaviors
16
ldr r1,[r6,r4, lsl #4]
ldr r12,[r6,r4, lsl #8]
ldr r8,[r6,r4, lsl #12]
str r8,[r3,r4, lsl #16]
str r12,[r3,r4, lsl #20]
str r1,[r3,r4, lsl #24]
Stack Reference Behavior
Application Configurable Architecture
17
Application configurable processors are designed using a mapping table similar to a register rename table found in many out of order implementations.
The map table is read during every access to the architected register file.
This serves as a method of determining if a register specifier is used in the original architected register file or a customized register structure.
Application Configurable Architecture
18
The customized register files are small in size but they efficiently manage the values that would require many architected registers.
The customized register files can mimic queues, stacks, and circular buffers.
These structures are accessed using the same register specifier that is used to access the architected register file.
RemoveReference Behaviors
19
ldr r1,[r6,r4, lsl #4]
ldr r12,[r6,r4, lsl #8]
ldr r8,[r6,r4, lsl #12]
str r8,[r3,r4, lsl #16]
str r12,[r3,r4, lsl #20]
str r1,[r3,r4, lsl #24]
Stack Reference
Behavior
R8
R12
R1
r1
ldr r1,[r6,r4, lsl #4]
ldr r1,[r6,r4, lsl #8]
ldr r1,[r6,r4, lsl #12]
str r1,[r3,r4, lsl #16]
str r1,[r3,r4, lsl #20]
str r1,[r3,r4, lsl #24]
Free up r8 and r12 for use.
Modulo Scheduling
21
For our work we used modulo scheduling. This requires using the dependences and latencies of the loop instructions to generate a modulo scheduled loop.
The prolog and epilog are then built based off of this schedule.
The prolog and epilog in require register renaming of loop carried dependencies to verify a correct loop. Renaming in embedded processors is often not
possible.
Register Renaming due to software pipelining
22
Renaming doesn’t work… not enough registers.
Rotating registers would require a significant rewrite of the embedded ISA.
The loop carried values can simply be mapped into a register queue to hold the value across several iterations.