ca226 — advanced computer architectureray/teaching/ca226/06-scheduling.pdfca226 — advanced...
TRANSCRIPT
CA226 — AdvancedComputer Architecture
2
Load RAW Stalls
ld r1,0(r2)dadd r4,r3,r1 ; unavoidable stall (on r1)
The value needed by the second instruction in Ex is available only after the firstinstruction has completed Mem.
CA226 — AdvancedComputer Architecture
3
Branch RAW Stalls
dsub r1,r2,r3beqz r1,target
We can’t forward the necessary value until after Ex:
• hence, a stall of one cycle(whether the branch is taken or not)
CA226 — AdvancedComputer Architecture
4
A "Double Whammy" Stallld r1,0(r2)beqz r1,target
Stall of two cycles.This is a combination of the two previous stalls.
CA226 — AdvancedComputer Architecture
5
The PipelineThe pipeline:
• is essentially a miniature graph of parallel-processing elements
• instructions flow from node to node
CA226 — AdvancedComputer Architecture
6
The Pipeline
CA226 — AdvancedComputer Architecture
7
Consider this …dadd r3,r1,r2dadd r4,r1,r2
They flow:
• smoothly through the pipeline, no stalls
CA226 — AdvancedComputer Architecture
8
Now consider this …dadd r3,r1,r2dmul r4,r1,r2
They flow:
• more slowly (more cycles), but still no stalls(multiplication is expensive)
CA226 — AdvancedComputer Architecture
9
And this …dmul r3,r1,r2dmul r4,r1,r2dmul r5,r1,r2
They flow:
• again, more slowly (more cycles), but still no stalls
• all three instructions flow through the multiplier
CA226 — AdvancedComputer Architecture
10
And this …dmul r4,r1,r2dadd r3,r1,r2
The dadd is not blocked by the dmul:
• the dadd overtakes the dmul in the pipeline
• still no stalls
CA226 — AdvancedComputer Architecture
11
Write-After-Write (WAW) Stallsdmul r3,r1,r2dadd r3,r1,r2
The dadd is now blocked by the dmul:
• were the dadd to overtake the dmul:r3 would have the incorrect final value
This is known as a:
• write-after-write (WAW) stall
CA226 — AdvancedComputer Architecture
12
Write-After-Write (WAW) Stallsdmul r3,r1,r2dadd r3,r1,r2daddi r5,r0,100
Note:
• subsequent, independent instructions are also blocked!
CA226 — AdvancedComputer Architecture
13
Another topic…
CA226 — AdvancedComputer Architecture
14
ExampleConsider:
for (i=0; i<1000; i+=1) a[i] += 1; // where a[i] is an integer
CA226 — AdvancedComputer Architecture
15
Example.data ; psched3.s N: .word 8000 ; N = 1000 iterations a: .space 8000 ; 1000 64-bit words
.text daddi r1,r0,0 ; r1 = i = 0 ld r2,N(r0) ; r2 = N
loop: ld r3,a(r1) ; r3 = a[i/8] daddi r3,r3,1 ; r3 = r3 + 1 sd r3,a(r1) ; a[i/8] = r3 daddi r1,r1,8 ; i = i + 8 bne r1,r2,loop ; repeat, unless i == N nop halt
CA226 — AdvancedComputer Architecture
16
So …Stalls per iteration:
• one load RAW stall on r3
• one branch RAW stall on r1
• plus 999 wasted nop cycles (in delay slot)
8007 cycles in total
CA226 — AdvancedComputer Architecture
17
After pipeline scheduling, ….data ; psched4.s N: .word 8000 b: .word 0 ; Address of a, minus 8 (hack) a: .space 8000
.text daddi r1,r0,0 ld r2,N(r0)
loop: ld r3,a(r1) daddi r1,r1,8 ; moved up daddi r3,r3,1 bne r1,r2,loop sd r3,b(r1) ; moved down, and adjusted halt
CA226 — AdvancedComputer Architecture
18
So …No stalls!
• just 5007 cycles, a substantial improvement
• CPI of 1.001
I suspect:
• we can’t do much better than that!(every instruction/cycle does something which has to be done)
CA226 — AdvancedComputer Architecture
19
Now, …Let’s try the same thing:
• but with floating point numbers
CA226 — AdvancedComputer Architecture
20
Example — Floating Point.data ; psched5.s one: .double 1.0 ; one N: .word 8000 ; N = 1000 iterations a: .space 8000 ; 1000 64-bit floating point values
.text l.d f10,one(r0) ; r10 = 1 (one) daddi r1,r0,0 ; r1 = i = 0 ld r2,N(r0) ; r2 = N
loop: l.d f3,a(r1) ; f3 = a[i/8] add.d f3,f3,f10 ; f3 = f3 + one s.d f3,a(r1) ; a[i/8] = f3 daddi r1,r1,8 ; i = i + 8 bne r1,r2,loop ; repeat, unless i == N nop halt
CA226 — AdvancedComputer Architecture
21
Stalls?Stalls:
• 5000 RAW stalls
• 1000 structural stalls
Overall: 11008 cycles.
CA226 — AdvancedComputer Architecture
22
As Before: Reorder Operations.data ; psched6.s one: .double 1.0 ; one N: .word 8000 ; N = 1000 iterations b: .word 0 ; Address of a, minus 8 a: .space 8000 ; 1000 64-bit floating point values
.text l.d f10,one(r0) ; r10 = 1 (one) daddi r1,r0,0 ; r1 = i = 0 ld r2,N(r0) ; r2 = N
loop: l.d f3,a(r1) ; f3 = a[i/8] daddi r1,r1,8 ; i = i + 8 add.d f3,f3,f10 ; f3 = f3 + one bne r1,r2,loop ; repeat, unless i == N s.d f3,b(r1) ; a[i/8] = f3 halt
CA226 — AdvancedComputer Architecture
23
Stalls?Stalls:
• 1000 RAW stalls
• 1000 structural stalls
Overall: 7008 cycles; better than 11008, previously.
CA226 — AdvancedComputer Architecture
24
Stalls?Remaining stalls, both RAW and structural are from:
add.d f3,f3,f10 ; f3 = f3 + one...s.d f3,b(r1) ; a[i] = f3
It takes four cycles for the add.d to move through the floating point adder.
The s.d (a read after write) arrives too soon, and is blocked for two cycles (periteration).
CA226 — AdvancedComputer Architecture
25
So, …How can we eliminate the remaining stalls?
CA226 — AdvancedComputer Architecture
26
Loop UnrollingOriginally:
for (i=0; i<1000; i+=1) a[i] += 1
Unroll the loop:
for (i=0; i<1000; i+=4){ a[i+0] += 1 a[i+1] += 1 a[i+2] += 1 a[i+3] += 1}
CA226 — AdvancedComputer Architecture
27
Example.data ; psched7.s one: .double 1.0 ; one N: .word 8000 ; N = 1000 iterations b: .word 0 ; Address of a, minus 8 a: .space 8000 ; 1000 64-bit floating point values
.text l.d f10,one(r0) ; r10 = 1 (one) daddi r1,r0,0 ; r1 = i = 0 ld r2,N(r0) ; r2 = N
loop: l.d f3,a(r1) ; 1 daddi r1,r1,8 ; add.d f3,f3,f10 ; s.d f3,b(r1) ;
l.d f4,a(r1) ; 2 - now using f4, instead of f3
CA226 — AdvancedComputer Architecture
28
daddi r1,r1,8 ; add.d f4,f4,f10 ; s.d f4,b(r1) ;
l.d f5,a(r1) ; 3 - now using f5, instead of f3 daddi r1,r1,8 ; add.d f5,f5,f10 ; s.d f5,b(r1) ;
l.d f6,a(r1) ; 4 - now using f6, instead of f3 daddi r1,r1,8 ; add.d f6,f6,f10 ; s.d f6,b(r1) ;
bne r1,r2,loop ; repeat, unless i == N nop halt
CA226 — AdvancedComputer Architecture
29
Now, …That doesn’t help:
• but many of these operations are now independent
• they can be reordered
First:
• we don’t need all those dadd instructions
• and we can use the delay slot for something useful
CA226 — AdvancedComputer Architecture
30
Example.data ; psched8.s one: .double 1.0 N: .word 8000 a: .space 8000
.text ld r2,N(r0) l.d f10,one(r0) daddi r11,r0,a dadd r12,r11,r2
loop: l.d f3,0(r11) add.d f3,f3,f10 s.d f3,0(r11)
l.d f4,8(r11) ; adjust displacement add.d f4,f4,f10
CA226 — AdvancedComputer Architecture
31
s.d f4,8(r11) ; adjust displacement
l.d f5,16(r11) ; adjust displacement add.d f5,f5,f10 s.d f5,16(r11) ; adjust displacement
l.d f6,24(r11) ; adjust displacement add.d f6,f6,f10
daddi r11,r11,32 ; collect all four daddi-s into one
bne r11,r12,loop s.d f6,-8(r11) ; 24 - 32 == -8, adjust displacement halt
CA226 — AdvancedComputer Architecture
32
Hmm, …Still:
• 7009 cycles
• fewer instructions, more stalls, same number of cycles
We need to:
• look more carefully at how the pipeline is operating
• explore more options for reordering operations
CA226 — AdvancedComputer Architecture
33
Best we can do?.data ; psched9.s one: .double 1.0 N: .word 8000 a: .space 8000
.text ld r2,N(r0) l.d f10,one(r0) daddi r11,r0,a dadd r12,r11,r2
loop: l.d f3,0(r11) ; i%4==0 load l.d f4,8(r11) ; i%4==1 load
add.d f3,f3,f10 ; i%4==0 add l.d f5,16(r11) ; i%4==2 load l.d f6,24(r11) ; i%4==3 load
CA226 — AdvancedComputer Architecture
34
add.d f4,f4,f10 ; i%4==1 add s.d f3,0(r11) ; i%4==0 store daddi r11,r11,32 ; increment loop counter
add.d f5,f5,f10 ; i%4==2 add add.d f6,f6,f10 ; i%4==3 add
s.d f4,-24(r11) ; i%4==1 store (8-32 == -24) s.d f5,-16(r11) ; i%4==2 store (16-32 == -16)
bne r11,r12,loop s.d f6,-8(r11) ; i%4==3 store (24-32 == -8) halt
CA226 — AdvancedComputer Architecture
35
Differences …Differences to previous version:
• reorder independent instructionspreserve the order of dependent instructions
• increment the loop counter sooner and adjust subsequent offsets
• carefully interleave FP and non-FP instructionsallow non-FP instructions to flow around FP instructions in the FP adder
CA226 — AdvancedComputer Architecture
36
PerformanceNow:
• 4009 cycles, CPI of 1.144
• just 500 structural stalls
Previously:
• 11008 cycles, CPI of 1.833
• a speedup of 60%
CA226 — AdvancedComputer Architecture
37
More GenerallyUnroll the loop some number of times (four, here):
for (i=0; i<N-N%4; i+=4){ a[i+0] += 1 a[i+1] += 1 a[i+2] += 1 a[i+3] += 1}
// Perform any remaining iterations...for (i=N-N%4; i<N; i+=1) a[i] += 1
CA226 — AdvancedComputer Architecture
38
Costs?Costs:
• increase in code size(so this is a space-time trade off)
• requires more registers(a limited resource)
In fact:
• this type of pipeline scheduling is only possible because we have a large number ofgeneral-purpose registers
CA226 — AdvancedComputer Architecture
39
Done<script> (function() { var mathjax = 'mathjax/MathJax.js?config=asciimath'; // var mathjax= 'http://smblott.computing.dcu.ie/mathjax/MathJax.js?config=asciimath'; var element= document.createElement('script'); element.async = true; element.src = mathjax;element.type = 'text/javascript'; (document.getElementsByTagName('HEAD')[0]||document.body).appendChild(element); })(); </script>