ca226 — advanced computer architectureray/teaching/ca226/06-scheduling.pdfca226 — advanced...

39
1 CA226 — Advanced Computer Architecture Stephen Blott <[email protected]> Table of Contents

Upload: truongcong

Post on 22-Apr-2018

219 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

1

CA226 — AdvancedComputer Architecture

Stephen Blott <[email protected]>

Table of Contents

Page 2: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

2

Load RAW Stalls

ld r1,0(r2)dadd r4,r3,r1 ; unavoidable stall (on r1)

The value needed by the second instruction in Ex is available only after the firstinstruction has completed Mem.

Page 3: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

3

Branch RAW Stalls

dsub r1,r2,r3beqz r1,target

We can’t forward the necessary value until after Ex:

• hence, a stall of one cycle(whether the branch is taken or not)

Page 4: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

4

A "Double Whammy" Stallld r1,0(r2)beqz r1,target

Stall of two cycles.This is a combination of the two previous stalls.

Page 5: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

5

The PipelineThe pipeline:

• is essentially a miniature graph of parallel-processing elements

• instructions flow from node to node

Page 6: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

6

The Pipeline

Page 7: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

7

Consider this …dadd r3,r1,r2dadd r4,r1,r2

They flow:

• smoothly through the pipeline, no stalls

Page 8: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

8

Now consider this …dadd r3,r1,r2dmul r4,r1,r2

They flow:

• more slowly (more cycles), but still no stalls(multiplication is expensive)

Page 9: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

9

And this …dmul r3,r1,r2dmul r4,r1,r2dmul r5,r1,r2

They flow:

• again, more slowly (more cycles), but still no stalls

• all three instructions flow through the multiplier

Page 10: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

10

And this …dmul r4,r1,r2dadd r3,r1,r2

The dadd is not blocked by the dmul:

• the dadd overtakes the dmul in the pipeline

• still no stalls

Page 11: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

11

Write-After-Write (WAW) Stallsdmul r3,r1,r2dadd r3,r1,r2

The dadd is now blocked by the dmul:

• were the dadd to overtake the dmul:r3 would have the incorrect final value

This is known as a:

• write-after-write (WAW) stall

Page 12: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

12

Write-After-Write (WAW) Stallsdmul r3,r1,r2dadd r3,r1,r2daddi r5,r0,100

Note:

• subsequent, independent instructions are also blocked!

Page 13: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

13

Another topic…

Page 14: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

14

ExampleConsider:

for (i=0; i<1000; i+=1) a[i] += 1; // where a[i] is an integer

Page 15: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

15

Example.data ; psched3.s N: .word 8000 ; N = 1000 iterations a: .space 8000 ; 1000 64-bit words

.text daddi r1,r0,0 ; r1 = i = 0 ld r2,N(r0) ; r2 = N

loop: ld r3,a(r1) ; r3 = a[i/8] daddi r3,r3,1 ; r3 = r3 + 1 sd r3,a(r1) ; a[i/8] = r3 daddi r1,r1,8 ; i = i + 8 bne r1,r2,loop ; repeat, unless i == N nop halt

Page 16: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

16

So …Stalls per iteration:

• one load RAW stall on r3

• one branch RAW stall on r1

• plus 999 wasted nop cycles (in delay slot)

8007 cycles in total

Page 17: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

17

After pipeline scheduling, ….data ; psched4.s N: .word 8000 b: .word 0 ; Address of a, minus 8 (hack) a: .space 8000

.text daddi r1,r0,0 ld r2,N(r0)

loop: ld r3,a(r1) daddi r1,r1,8 ; moved up daddi r3,r3,1 bne r1,r2,loop sd r3,b(r1) ; moved down, and adjusted halt

Page 18: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

18

So …No stalls!

• just 5007 cycles, a substantial improvement

• CPI of 1.001

I suspect:

• we can’t do much better than that!(every instruction/cycle does something which has to be done)

Page 19: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

19

Now, …Let’s try the same thing:

• but with floating point numbers

Page 20: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

20

Example — Floating Point.data ; psched5.s one: .double 1.0 ; one N: .word 8000 ; N = 1000 iterations a: .space 8000 ; 1000 64-bit floating point values

.text l.d f10,one(r0) ; r10 = 1 (one) daddi r1,r0,0 ; r1 = i = 0 ld r2,N(r0) ; r2 = N

loop: l.d f3,a(r1) ; f3 = a[i/8] add.d f3,f3,f10 ; f3 = f3 + one s.d f3,a(r1) ; a[i/8] = f3 daddi r1,r1,8 ; i = i + 8 bne r1,r2,loop ; repeat, unless i == N nop halt

Page 21: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

21

Stalls?Stalls:

• 5000 RAW stalls

• 1000 structural stalls

Overall: 11008 cycles.

Page 22: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

22

As Before: Reorder Operations.data ; psched6.s one: .double 1.0 ; one N: .word 8000 ; N = 1000 iterations b: .word 0 ; Address of a, minus 8 a: .space 8000 ; 1000 64-bit floating point values

.text l.d f10,one(r0) ; r10 = 1 (one) daddi r1,r0,0 ; r1 = i = 0 ld r2,N(r0) ; r2 = N

loop: l.d f3,a(r1) ; f3 = a[i/8] daddi r1,r1,8 ; i = i + 8 add.d f3,f3,f10 ; f3 = f3 + one bne r1,r2,loop ; repeat, unless i == N s.d f3,b(r1) ; a[i/8] = f3 halt

Page 23: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

23

Stalls?Stalls:

• 1000 RAW stalls

• 1000 structural stalls

Overall: 7008 cycles; better than 11008, previously.

Page 24: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

24

Stalls?Remaining stalls, both RAW and structural are from:

add.d f3,f3,f10 ; f3 = f3 + one...s.d f3,b(r1) ; a[i] = f3

It takes four cycles for the add.d to move through the floating point adder.

The s.d (a read after write) arrives too soon, and is blocked for two cycles (periteration).

Page 25: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

25

So, …How can we eliminate the remaining stalls?

Page 26: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

26

Loop UnrollingOriginally:

for (i=0; i<1000; i+=1) a[i] += 1

Unroll the loop:

for (i=0; i<1000; i+=4){ a[i+0] += 1 a[i+1] += 1 a[i+2] += 1 a[i+3] += 1}

Page 27: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

27

Example.data ; psched7.s one: .double 1.0 ; one N: .word 8000 ; N = 1000 iterations b: .word 0 ; Address of a, minus 8 a: .space 8000 ; 1000 64-bit floating point values

.text l.d f10,one(r0) ; r10 = 1 (one) daddi r1,r0,0 ; r1 = i = 0 ld r2,N(r0) ; r2 = N

loop: l.d f3,a(r1) ; 1 daddi r1,r1,8 ; add.d f3,f3,f10 ; s.d f3,b(r1) ;

l.d f4,a(r1) ; 2 - now using f4, instead of f3

Page 28: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

28

daddi r1,r1,8 ; add.d f4,f4,f10 ; s.d f4,b(r1) ;

l.d f5,a(r1) ; 3 - now using f5, instead of f3 daddi r1,r1,8 ; add.d f5,f5,f10 ; s.d f5,b(r1) ;

l.d f6,a(r1) ; 4 - now using f6, instead of f3 daddi r1,r1,8 ; add.d f6,f6,f10 ; s.d f6,b(r1) ;

bne r1,r2,loop ; repeat, unless i == N nop halt

Page 29: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

29

Now, …That doesn’t help:

• but many of these operations are now independent

• they can be reordered

First:

• we don’t need all those dadd instructions

• and we can use the delay slot for something useful

Page 30: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

30

Example.data ; psched8.s one: .double 1.0 N: .word 8000 a: .space 8000

.text ld r2,N(r0) l.d f10,one(r0) daddi r11,r0,a dadd r12,r11,r2

loop: l.d f3,0(r11) add.d f3,f3,f10 s.d f3,0(r11)

l.d f4,8(r11) ; adjust displacement add.d f4,f4,f10

Page 31: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

31

s.d f4,8(r11) ; adjust displacement

l.d f5,16(r11) ; adjust displacement add.d f5,f5,f10 s.d f5,16(r11) ; adjust displacement

l.d f6,24(r11) ; adjust displacement add.d f6,f6,f10

daddi r11,r11,32 ; collect all four daddi-s into one

bne r11,r12,loop s.d f6,-8(r11) ; 24 - 32 == -8, adjust displacement halt

Page 32: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

32

Hmm, …Still:

• 7009 cycles

• fewer instructions, more stalls, same number of cycles

We need to:

• look more carefully at how the pipeline is operating

• explore more options for reordering operations

Page 33: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

33

Best we can do?.data ; psched9.s one: .double 1.0 N: .word 8000 a: .space 8000

.text ld r2,N(r0) l.d f10,one(r0) daddi r11,r0,a dadd r12,r11,r2

loop: l.d f3,0(r11) ; i%4==0 load l.d f4,8(r11) ; i%4==1 load

add.d f3,f3,f10 ; i%4==0 add l.d f5,16(r11) ; i%4==2 load l.d f6,24(r11) ; i%4==3 load

Page 34: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

34

add.d f4,f4,f10 ; i%4==1 add s.d f3,0(r11) ; i%4==0 store daddi r11,r11,32 ; increment loop counter

add.d f5,f5,f10 ; i%4==2 add add.d f6,f6,f10 ; i%4==3 add

s.d f4,-24(r11) ; i%4==1 store (8-32 == -24) s.d f5,-16(r11) ; i%4==2 store (16-32 == -16)

bne r11,r12,loop s.d f6,-8(r11) ; i%4==3 store (24-32 == -8) halt

Page 35: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

35

Differences …Differences to previous version:

• reorder independent instructionspreserve the order of dependent instructions

• increment the loop counter sooner and adjust subsequent offsets

• carefully interleave FP and non-FP instructionsallow non-FP instructions to flow around FP instructions in the FP adder

Page 36: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

36

PerformanceNow:

• 4009 cycles, CPI of 1.144

• just 500 structural stalls

Previously:

• 11008 cycles, CPI of 1.833

• a speedup of 60%

Page 37: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

37

More GenerallyUnroll the loop some number of times (four, here):

for (i=0; i<N-N%4; i+=4){ a[i+0] += 1 a[i+1] += 1 a[i+2] += 1 a[i+3] += 1}

// Perform any remaining iterations...for (i=N-N%4; i<N; i+=1) a[i] += 1

Page 38: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

38

Costs?Costs:

• increase in code size(so this is a space-time trade off)

• requires more registers(a limited resource)

In fact:

• this type of pipeline scheduling is only possible because we have a large number ofgeneral-purpose registers

Page 39: CA226 — Advanced Computer Architectureray/teaching/CA226/06-scheduling.pdfCA226 — Advanced Computer Architecture 3 Branch RAW Stalls dsub r1,r2,r3 beqz r1,target We can’t forward

CA226 — AdvancedComputer Architecture

39

Done<script> (function() { var mathjax = 'mathjax/MathJax.js?config=asciimath'; // var mathjax= 'http://smblott.computing.dcu.ie/mathjax/MathJax.js?config=asciimath'; var element= document.createElement('script'); element.async = true; element.src = mathjax;element.type = 'text/javascript'; (document.getElementsByTagName('HEAD')[0]||document.body).appendChild(element); })(); </script>