ca226 — advanced computer architectureray/teaching/ca226/06-scheduling.pdfca226 — advanced...

1

CA226 — AdvancedComputer Architecture

Stephen Blott <[email protected]>

Table of Contents


2

Load RAW Stalls

ld r1,0(r2)dadd r4,r3,r1 ; unavoidable stall (on r1)

The value needed by the second instruction in Ex is available only after the firstinstruction has completed Mem.


3

Branch RAW Stalls

dsub r1,r2,r3beqz r1,target

We can’t forward the necessary value until after Ex:

• hence, a stall of one cycle(whether the branch is taken or not)


4

A "Double Whammy" Stallld r1,0(r2)beqz r1,target

Stall of two cycles.This is a combination of the two previous stalls.


5

The PipelineThe pipeline:

• is essentially a miniature graph of parallel-processing elements

• instructions flow from node to node


6

The Pipeline


7

Consider this …dadd r3,r1,r2dadd r4,r1,r2

They flow:

• smoothly through the pipeline, no stalls


8

Now consider this …dadd r3,r1,r2dmul r4,r1,r2

They flow:

• more slowly (more cycles), but still no stalls(multiplication is expensive)


9

And this …dmul r3,r1,r2dmul r4,r1,r2dmul r5,r1,r2

They flow:

• again, more slowly (more cycles), but still no stalls

• all three instructions flow through the multiplier


10

And this …dmul r4,r1,r2dadd r3,r1,r2

The dadd is not blocked by the dmul:

• the dadd overtakes the dmul in the pipeline

• still no stalls


11

Write-After-Write (WAW) Stallsdmul r3,r1,r2dadd r3,r1,r2

The dadd is now blocked by the dmul:

• were the dadd to overtake the dmul:r3 would have the incorrect final value

This is known as a:

• write-after-write (WAW) stall


12

Write-After-Write (WAW) Stallsdmul r3,r1,r2dadd r3,r1,r2daddi r5,r0,100

Note:

• subsequent, independent instructions are also blocked!


13

Another topic…


14

ExampleConsider:

for (i=0; i<1000; i+=1) a[i] += 1; // where a[i] is an integer


15

Example.data ; psched3.s N: .word 8000 ; N = 1000 iterations a: .space 8000 ; 1000 64-bit words

.text daddi r1,r0,0 ; r1 = i = 0 ld r2,N(r0) ; r2 = N

loop: ld r3,a(r1) ; r3 = a[i/8] daddi r3,r3,1 ; r3 = r3 + 1 sd r3,a(r1) ; a[i/8] = r3 daddi r1,r1,8 ; i = i + 8 bne r1,r2,loop ; repeat, unless i == N nop halt


16

So …Stalls per iteration:

• one load RAW stall on r3

• one branch RAW stall on r1

• plus 999 wasted nop cycles (in delay slot)

8007 cycles in total


17

After pipeline scheduling, ….data ; psched4.s N: .word 8000 b: .word 0 ; Address of a, minus 8 (hack) a: .space 8000

.text daddi r1,r0,0 ld r2,N(r0)

loop: ld r3,a(r1) daddi r1,r1,8 ; moved up daddi r3,r3,1 bne r1,r2,loop sd r3,b(r1) ; moved down, and adjusted halt


18

So …No stalls!

• just 5007 cycles, a substantial improvement

• CPI of 1.001

I suspect:

• we can’t do much better than that!(every instruction/cycle does something which has to be done)


19

Now, …Let’s try the same thing:

• but with floating point numbers


20

Example — Floating Point.data ; psched5.s one: .double 1.0 ; one N: .word 8000 ; N = 1000 iterations a: .space 8000 ; 1000 64-bit floating point values

.text l.d f10,one(r0) ; r10 = 1 (one) daddi r1,r0,0 ; r1 = i = 0 ld r2,N(r0) ; r2 = N

loop: l.d f3,a(r1) ; f3 = a[i/8] add.d f3,f3,f10 ; f3 = f3 + one s.d f3,a(r1) ; a[i/8] = f3 daddi r1,r1,8 ; i = i + 8 bne r1,r2,loop ; repeat, unless i == N nop halt


21

Stalls?Stalls:

• 5000 RAW stalls

• 1000 structural stalls

Overall: 11008 cycles.


22

As Before: Reorder Operations.data ; psched6.s one: .double 1.0 ; one N: .word 8000 ; N = 1000 iterations b: .word 0 ; Address of a, minus 8 a: .space 8000 ; 1000 64-bit floating point values


loop: l.d f3,a(r1) ; f3 = a[i/8] daddi r1,r1,8 ; i = i + 8 add.d f3,f3,f10 ; f3 = f3 + one bne r1,r2,loop ; repeat, unless i == N s.d f3,b(r1) ; a[i/8] = f3 halt


23

Stalls?Stalls:

• 1000 RAW stalls

• 1000 structural stalls

Overall: 7008 cycles; better than 11008, previously.


24

Stalls?Remaining stalls, both RAW and structural are from:

add.d f3,f3,f10 ; f3 = f3 + one...s.d f3,b(r1) ; a[i] = f3

It takes four cycles for the add.d to move through the floating point adder.

The s.d (a read after write) arrives too soon, and is blocked for two cycles (periteration).


25

So, …How can we eliminate the remaining stalls?


26

Loop UnrollingOriginally:

for (i=0; i<1000; i+=1) a[i] += 1

Unroll the loop:

for (i=0; i<1000; i+=4){ a[i+0] += 1 a[i+1] += 1 a[i+2] += 1 a[i+3] += 1}


27

Example.data ; psched7.s one: .double 1.0 ; one N: .word 8000 ; N = 1000 iterations b: .word 0 ; Address of a, minus 8 a: .space 8000 ; 1000 64-bit floating point values


loop: l.d f3,a(r1) ; 1 daddi r1,r1,8 ; add.d f3,f3,f10 ; s.d f3,b(r1) ;

l.d f4,a(r1) ; 2 - now using f4, instead of f3


28

daddi r1,r1,8 ; add.d f4,f4,f10 ; s.d f4,b(r1) ;

l.d f5,a(r1) ; 3 - now using f5, instead of f3 daddi r1,r1,8 ; add.d f5,f5,f10 ; s.d f5,b(r1) ;

l.d f6,a(r1) ; 4 - now using f6, instead of f3 daddi r1,r1,8 ; add.d f6,f6,f10 ; s.d f6,b(r1) ;

bne r1,r2,loop ; repeat, unless i == N nop halt


29

Now, …That doesn’t help:

• but many of these operations are now independent

• they can be reordered

First:

• we don’t need all those dadd instructions

• and we can use the delay slot for something useful


30

Example.data ; psched8.s one: .double 1.0 N: .word 8000 a: .space 8000

.text ld r2,N(r0) l.d f10,one(r0) daddi r11,r0,a dadd r12,r11,r2

loop: l.d f3,0(r11) add.d f3,f3,f10 s.d f3,0(r11)

l.d f4,8(r11) ; adjust displacement add.d f4,f4,f10


31

s.d f4,8(r11) ; adjust displacement

l.d f5,16(r11) ; adjust displacement add.d f5,f5,f10 s.d f5,16(r11) ; adjust displacement

l.d f6,24(r11) ; adjust displacement add.d f6,f6,f10

daddi r11,r11,32 ; collect all four daddi-s into one

bne r11,r12,loop s.d f6,-8(r11) ; 24 - 32 == -8, adjust displacement halt


32

Hmm, …Still:

• 7009 cycles

• fewer instructions, more stalls, same number of cycles

We need to:

• look more carefully at how the pipeline is operating

• explore more options for reordering operations


33

Best we can do?.data ; psched9.s one: .double 1.0 N: .word 8000 a: .space 8000

.text ld r2,N(r0) l.d f10,one(r0) daddi r11,r0,a dadd r12,r11,r2

loop: l.d f3,0(r11) ; i%4==0 load l.d f4,8(r11) ; i%4==1 load

add.d f3,f3,f10 ; i%4==0 add l.d f5,16(r11) ; i%4==2 load l.d f6,24(r11) ; i%4==3 load


34

add.d f4,f4,f10 ; i%4==1 add s.d f3,0(r11) ; i%4==0 store daddi r11,r11,32 ; increment loop counter

add.d f5,f5,f10 ; i%4==2 add add.d f6,f6,f10 ; i%4==3 add

s.d f4,-24(r11) ; i%4==1 store (8-32 == -24) s.d f5,-16(r11) ; i%4==2 store (16-32 == -16)

bne r11,r12,loop s.d f6,-8(r11) ; i%4==3 store (24-32 == -8) halt


35

Differences …Differences to previous version:

• reorder independent instructionspreserve the order of dependent instructions

• increment the loop counter sooner and adjust subsequent offsets

• carefully interleave FP and non-FP instructionsallow non-FP instructions to flow around FP instructions in the FP adder


36

PerformanceNow:

• 4009 cycles, CPI of 1.144

• just 500 structural stalls

Previously:

• 11008 cycles, CPI of 1.833

• a speedup of 60%


37

More GenerallyUnroll the loop some number of times (four, here):

for (i=0; i<N-N%4; i+=4){ a[i+0] += 1 a[i+1] += 1 a[i+2] += 1 a[i+3] += 1}

// Perform any remaining iterations...for (i=N-N%4; i<N; i+=1) a[i] += 1


38

Costs?Costs:

• increase in code size(so this is a space-time trade off)

• requires more registers(a limited resource)

In fact:

• this type of pipeline scheduling is only possible because we have a large number ofgeneral-purpose registers


39

Done<script> (function() { var mathjax = 'mathjax/MathJax.js?config=asciimath'; // var mathjax= 'http://smblott.computing.dcu.ie/mathjax/MathJax.js?config=asciimath'; var element= document.createElement('script'); element.async = true; element.src = mathjax;element.type = 'text/javascript'; (document.getElementsByTagName('HEAD')[0]||document.body).appendChild(element); })(); </script>

ca226 — advanced computer architectureray/teaching/ca226/06-scheduling.pdfca226 — advanced...

Documents