pipeline enhancements for the y86 architecture kelly carothers
TRANSCRIPT
Pipeline Enhancements for the Y86 Architecture
Kelly Carothers
Enhancments done
Hardware:BTFNT Branch
JumpingLoad-forwarding for
variables
Software:Use of IADDLRearrangement of codeLoop Unrolling
Load-forwarding
The passing of variables from further in the pipe backwards before it is written to a register or memory.
CPE Avg: 17.15
Load-forwarding from Memory stage to Execute Stage
IADDL
Single instruction replaces the IRMOVL and ADDL instructions for an immediate add.
CPE Avg: 14.22
IADDL implementation
IADDL Code Comparison: Original vs. Modified
# Loop header xorl %esi,%esi # count = 0; andl %edx,%edx # len <= 0? jle Done # if so, goto Done:
# Loop body.Loop: mrmovl (%ebx), %eax rmmovl %eax, (%ecx) andl %eax, %eax # val <= 0? jle Npos # if so, goto Npos:
iaddl $1, %esi # count++Npos: iaddl $-1, %edx # len--
iaddl $4, %ebx # src++iaddl $4, %ecx # dst++
andl %edx,%edx # len > 0? jg Loop # if so, goto Loop:
# Loop header xorl %esi,%esi # count = 0; andl %edx,%edx # len <= 0? jle Done # if so, goto Done:
# Loop body.Loop: mrmovl (%ebx), %eax #src... rmmovl %eax, (%ecx) # ...and store it to dst andl %eax, %eax # val <= 0? jle Npos # if so, goto Npos: irmovl $1, %edi addl %edi, %esi # count++Npos: irmovl $1, %edi subl %edi, %edx # len-- irmovl $4, %edi addl %edi, %ebx # src++ addl %edi, %ecx # dst++ andl %edx,%edx # len > 0? jg Loop # if so, goto Loop:
BTFNT Branch Jumping
BTFNT – Backwards Taken Forwards Not Taken:Always take the smaller address.
CPE Avg : 12.37
Code Rearrangement
*Code was arranged specifically for BTFNT*Many unnecessary checks removed
Avg CPE: 11.71
Code Rearrangement: IADDL Mod vs. End Result
rrmovl %edx, %esi iaddl $1, %edx
Loop: iaddl $-1, %edx jle DoneLoop1: mrmovl (%ebx), %eax rmmovl %eax, (%ecx)Npos: iaddl $4, %ebx # src++ iaddl $4, %ecx # dst++ andl %eax, %eax jle decEsi jmp Loop
decEsi: iaddl $-1, %esi, jg Loop
# Loop header xorl %esi,%esi # count = 0; andl %edx,%edx # len <= 0? jle Done # if so, goto Done:
# Loop body.Loop: mrmovl (%ebx), % rmmovl %eax, (%ecx) andl %eax, %eax # val <= 0? jle Npos # if so, goto Npos:
iaddl $1, %esi # count++Npos: iaddl $-1, %edx # len--
iaddl $4, %ebx # src++iaddl $4, %ecx # dst++
andl %edx,%edx # len > 0? jg Loop # if so, goto Loop:
Loop Unrolling
*Increases code size
*Decreases CPE
Loop Unrolling: No unrolling vs. 1 unroll
Loop1: mrmovl (%ebx), %eax rmmovl %eax, (%ecx)Npos: iaddl $4, %ebx # src++ iaddl $4, %ecx # dst++ andl %eax, %eax jle decEsi iaddl $-1, %edx jle Done
mrmovl (%ebx), %eax rmmovl %eax, (%ecx) iaddl $4, %ebx iaddl $4, %ecx andl %eax, %eax jle decEsi jmp Loop
Loop1: mrmovl (%ebx), %eax rmmovl %eax, (%ecx)Npos: iaddl $4, %ebx # src++ iaddl $4, %ecx # dst++ andl %eax, %eax jle decEsi jmp Loop
Loop Unrolling Results
No Unrolling, Base Avg. CPE: 11.64 1 Unroll, Avg CPE: 11.16
2 Unrolls, Avg CPE: 11.00
Total Results
Initial Avg CPE: 18.15
Final Avg CPE: 11.00
Total Decrease of 7.15 CPE.
Final Results
Enhancement AVG CPE CPE Decrease
None 18.15 -------
Load-Forwarding 17.15 1.00
IADDL 14.22 2.93
BTFNT 12.37 1.85
Code Rearranging 11.64 .73
1 Loop Unrolled 11.16 .48
2 Loops Unrolled 11.00 .16