pipeline enhancements for the y86 architecture kelly carothers

Pipeline Enhancements for the Y86 Architecture

Kelly Carothers

Enhancments done

Hardware:BTFNT Branch

JumpingLoad-forwarding for

variables

Software:Use of IADDLRearrangement of codeLoop Unrolling

Load-forwarding

The passing of variables from further in the pipe backwards before it is written to a register or memory.

CPE Avg: 17.15

Load-forwarding from Memory stage to Execute Stage

IADDL

Single instruction replaces the IRMOVL and ADDL instructions for an immediate add.

CPE Avg: 14.22

IADDL implementation

IADDL Code Comparison: Original vs. Modified

# Loop header xorl %esi,%esi # count = 0; andl %edx,%edx # len <= 0? jle Done # if so, goto Done:

# Loop body.Loop: mrmovl (%ebx), %eax rmmovl %eax, (%ecx) andl %eax, %eax # val <= 0? jle Npos # if so, goto Npos:

iaddl $1, %esi # count++Npos: iaddl $-1, %edx # len--

iaddl $4, %ebx # src++iaddl $4, %ecx # dst++

andl %edx,%edx # len > 0? jg Loop # if so, goto Loop:


# Loop body.Loop: mrmovl (%ebx), %eax #src... rmmovl %eax, (%ecx) # ...and store it to dst andl %eax, %eax # val <= 0? jle Npos # if so, goto Npos: irmovl $1, %edi addl %edi, %esi # count++Npos: irmovl $1, %edi subl %edi, %edx # len-- irmovl $4, %edi addl %edi, %ebx # src++ addl %edi, %ecx # dst++ andl %edx,%edx # len > 0? jg Loop # if so, goto Loop:

BTFNT Branch Jumping

BTFNT – Backwards Taken Forwards Not Taken:Always take the smaller address.

CPE Avg : 12.37

Code Rearrangement

*Code was arranged specifically for BTFNT*Many unnecessary checks removed

Avg CPE: 11.71

Code Rearrangement: IADDL Mod vs. End Result

rrmovl %edx, %esi iaddl $1, %edx

Loop: iaddl $-1, %edx jle DoneLoop1: mrmovl (%ebx), %eax rmmovl %eax, (%ecx)Npos: iaddl $4, %ebx # src++ iaddl $4, %ecx # dst++ andl %eax, %eax jle decEsi jmp Loop

decEsi: iaddl $-1, %esi, jg Loop


# Loop body.Loop: mrmovl (%ebx), % rmmovl %eax, (%ecx) andl %eax, %eax # val <= 0? jle Npos # if so, goto Npos:

iaddl $1, %esi # count++Npos: iaddl $-1, %edx # len--

iaddl $4, %ebx # src++iaddl $4, %ecx # dst++

andl %edx,%edx # len > 0? jg Loop # if so, goto Loop:

Loop Unrolling

*Increases code size

*Decreases CPE

Loop Unrolling: No unrolling vs. 1 unroll

Loop1: mrmovl (%ebx), %eax rmmovl %eax, (%ecx)Npos: iaddl $4, %ebx # src++ iaddl $4, %ecx # dst++ andl %eax, %eax jle decEsi iaddl $-1, %edx jle Done

mrmovl (%ebx), %eax rmmovl %eax, (%ecx) iaddl $4, %ebx iaddl $4, %ecx andl %eax, %eax jle decEsi jmp Loop

Loop1: mrmovl (%ebx), %eax rmmovl %eax, (%ecx)Npos: iaddl $4, %ebx # src++ iaddl $4, %ecx # dst++ andl %eax, %eax jle decEsi jmp Loop

Loop Unrolling Results

No Unrolling, Base Avg. CPE: 11.64 1 Unroll, Avg CPE: 11.16

2 Unrolls, Avg CPE: 11.00

Total Results

Initial Avg CPE: 18.15

Final Avg CPE: 11.00

Total Decrease of 7.15 CPE.

Final Results

Enhancement AVG CPE CPE Decrease

None 18.15 -------

Load-Forwarding 17.15 1.00

IADDL 14.22 2.93

BTFNT 12.37 1.85

Code Rearranging 11.64 .73

1 Loop Unrolled 11.16 .48

2 Loops Unrolled 11.00 .16

pipeline enhancements for the y86 architecture kelly carothers

Documents