![Page 1: Pipeline Enhancements for the Y86 Architecture Kelly Carothers](https://reader035.vdocument.in/reader035/viewer/2022071806/56649cfa5503460f949cbcfc/html5/thumbnails/1.jpg)
Pipeline Enhancements for the Y86 Architecture
Kelly Carothers
![Page 2: Pipeline Enhancements for the Y86 Architecture Kelly Carothers](https://reader035.vdocument.in/reader035/viewer/2022071806/56649cfa5503460f949cbcfc/html5/thumbnails/2.jpg)
Enhancments done
Hardware:BTFNT Branch
JumpingLoad-forwarding for
variables
Software:Use of IADDLRearrangement of codeLoop Unrolling
![Page 3: Pipeline Enhancements for the Y86 Architecture Kelly Carothers](https://reader035.vdocument.in/reader035/viewer/2022071806/56649cfa5503460f949cbcfc/html5/thumbnails/3.jpg)
Load-forwarding
The passing of variables from further in the pipe backwards before it is written to a register or memory.
CPE Avg: 17.15
![Page 4: Pipeline Enhancements for the Y86 Architecture Kelly Carothers](https://reader035.vdocument.in/reader035/viewer/2022071806/56649cfa5503460f949cbcfc/html5/thumbnails/4.jpg)
Load-forwarding from Memory stage to Execute Stage
![Page 5: Pipeline Enhancements for the Y86 Architecture Kelly Carothers](https://reader035.vdocument.in/reader035/viewer/2022071806/56649cfa5503460f949cbcfc/html5/thumbnails/5.jpg)
IADDL
Single instruction replaces the IRMOVL and ADDL instructions for an immediate add.
CPE Avg: 14.22
![Page 6: Pipeline Enhancements for the Y86 Architecture Kelly Carothers](https://reader035.vdocument.in/reader035/viewer/2022071806/56649cfa5503460f949cbcfc/html5/thumbnails/6.jpg)
IADDL implementation
![Page 7: Pipeline Enhancements for the Y86 Architecture Kelly Carothers](https://reader035.vdocument.in/reader035/viewer/2022071806/56649cfa5503460f949cbcfc/html5/thumbnails/7.jpg)
IADDL Code Comparison: Original vs. Modified
# Loop header xorl %esi,%esi # count = 0; andl %edx,%edx # len <= 0? jle Done # if so, goto Done:
# Loop body.Loop: mrmovl (%ebx), %eax rmmovl %eax, (%ecx) andl %eax, %eax # val <= 0? jle Npos # if so, goto Npos:
iaddl $1, %esi # count++Npos: iaddl $-1, %edx # len--
iaddl $4, %ebx # src++iaddl $4, %ecx # dst++
andl %edx,%edx # len > 0? jg Loop # if so, goto Loop:
# Loop header xorl %esi,%esi # count = 0; andl %edx,%edx # len <= 0? jle Done # if so, goto Done:
# Loop body.Loop: mrmovl (%ebx), %eax #src... rmmovl %eax, (%ecx) # ...and store it to dst andl %eax, %eax # val <= 0? jle Npos # if so, goto Npos: irmovl $1, %edi addl %edi, %esi # count++Npos: irmovl $1, %edi subl %edi, %edx # len-- irmovl $4, %edi addl %edi, %ebx # src++ addl %edi, %ecx # dst++ andl %edx,%edx # len > 0? jg Loop # if so, goto Loop:
![Page 8: Pipeline Enhancements for the Y86 Architecture Kelly Carothers](https://reader035.vdocument.in/reader035/viewer/2022071806/56649cfa5503460f949cbcfc/html5/thumbnails/8.jpg)
BTFNT Branch Jumping
BTFNT – Backwards Taken Forwards Not Taken:Always take the smaller address.
CPE Avg : 12.37
![Page 9: Pipeline Enhancements for the Y86 Architecture Kelly Carothers](https://reader035.vdocument.in/reader035/viewer/2022071806/56649cfa5503460f949cbcfc/html5/thumbnails/9.jpg)
Code Rearrangement
*Code was arranged specifically for BTFNT*Many unnecessary checks removed
Avg CPE: 11.71
![Page 10: Pipeline Enhancements for the Y86 Architecture Kelly Carothers](https://reader035.vdocument.in/reader035/viewer/2022071806/56649cfa5503460f949cbcfc/html5/thumbnails/10.jpg)
Code Rearrangement: IADDL Mod vs. End Result
rrmovl %edx, %esi iaddl $1, %edx
Loop: iaddl $-1, %edx jle DoneLoop1: mrmovl (%ebx), %eax rmmovl %eax, (%ecx)Npos: iaddl $4, %ebx # src++ iaddl $4, %ecx # dst++ andl %eax, %eax jle decEsi jmp Loop
decEsi: iaddl $-1, %esi, jg Loop
# Loop header xorl %esi,%esi # count = 0; andl %edx,%edx # len <= 0? jle Done # if so, goto Done:
# Loop body.Loop: mrmovl (%ebx), % rmmovl %eax, (%ecx) andl %eax, %eax # val <= 0? jle Npos # if so, goto Npos:
iaddl $1, %esi # count++Npos: iaddl $-1, %edx # len--
iaddl $4, %ebx # src++iaddl $4, %ecx # dst++
andl %edx,%edx # len > 0? jg Loop # if so, goto Loop:
![Page 11: Pipeline Enhancements for the Y86 Architecture Kelly Carothers](https://reader035.vdocument.in/reader035/viewer/2022071806/56649cfa5503460f949cbcfc/html5/thumbnails/11.jpg)
Loop Unrolling
*Increases code size
*Decreases CPE
![Page 12: Pipeline Enhancements for the Y86 Architecture Kelly Carothers](https://reader035.vdocument.in/reader035/viewer/2022071806/56649cfa5503460f949cbcfc/html5/thumbnails/12.jpg)
Loop Unrolling: No unrolling vs. 1 unroll
Loop1: mrmovl (%ebx), %eax rmmovl %eax, (%ecx)Npos: iaddl $4, %ebx # src++ iaddl $4, %ecx # dst++ andl %eax, %eax jle decEsi iaddl $-1, %edx jle Done
mrmovl (%ebx), %eax rmmovl %eax, (%ecx) iaddl $4, %ebx iaddl $4, %ecx andl %eax, %eax jle decEsi jmp Loop
Loop1: mrmovl (%ebx), %eax rmmovl %eax, (%ecx)Npos: iaddl $4, %ebx # src++ iaddl $4, %ecx # dst++ andl %eax, %eax jle decEsi jmp Loop
![Page 13: Pipeline Enhancements for the Y86 Architecture Kelly Carothers](https://reader035.vdocument.in/reader035/viewer/2022071806/56649cfa5503460f949cbcfc/html5/thumbnails/13.jpg)
Loop Unrolling Results
No Unrolling, Base Avg. CPE: 11.64 1 Unroll, Avg CPE: 11.16
2 Unrolls, Avg CPE: 11.00
![Page 14: Pipeline Enhancements for the Y86 Architecture Kelly Carothers](https://reader035.vdocument.in/reader035/viewer/2022071806/56649cfa5503460f949cbcfc/html5/thumbnails/14.jpg)
Total Results
Initial Avg CPE: 18.15
Final Avg CPE: 11.00
Total Decrease of 7.15 CPE.
![Page 15: Pipeline Enhancements for the Y86 Architecture Kelly Carothers](https://reader035.vdocument.in/reader035/viewer/2022071806/56649cfa5503460f949cbcfc/html5/thumbnails/15.jpg)
Final Results
Enhancement AVG CPE CPE Decrease
None 18.15 -------
Load-Forwarding 17.15 1.00
IADDL 14.22 2.93
BTFNT 12.37 1.85
Code Rearranging 11.64 .73
1 Loop Unrolled 11.16 .48
2 Loops Unrolled 11.00 .16
![Page 16: Pipeline Enhancements for the Y86 Architecture Kelly Carothers](https://reader035.vdocument.in/reader035/viewer/2022071806/56649cfa5503460f949cbcfc/html5/thumbnails/16.jpg)