![Page 1: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/1.jpg)
Lecture 9
Thread Divergence Scheduling
Instruction Level Parallelism
![Page 2: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/2.jpg)
Announcements • Tuesday’s lecture on 2/11 will be moved to
room 4140 from 6.30 PM to 7.50 PM • Office hours cancelled on Thursday; make
an appointment to see me later this week • CUDA profiling with nvprof
See slides 13-16 http://www.nvidia.com/docs/IO/116711/sc11-perf-optimization.pdf
Scott B. Baden /CSE 260/ Winter 2014 2
![Page 3: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/3.jpg)
3
Recapping optimized matrix multiply • Volkov and Demmel, SC08 • Improve performance using fewer threads
u Reducing concurrency frees up registers to trade locality against parallelism
u ILP to increase processor utilization
Scott B. Baden /CSE 260/ Winter 2014 3
![Page 4: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/4.jpg)
2/4/14 4
Hiding memory latency • Parallelism = latency × throughput
Arithmetic: 576 ops/SM = 18CP x 32/SM/CP Memory: 150KB = ~500CP (1100 nsec) x 150 GB/sec
• How can we keep 150KB in flight? u Multiple threads: ~35,000 threads @ 4B/thread u ILP (increase fetches per thread) u Larger fetches (64 or 128 bit/thread) u Higher occupancy
Copy 1 float /thread, need 100% occupancy int indx = threadIdx.x + block * blockDim.x; float a0 = src[indx]; dest[indx] = a0;
Copy 2 floats /thread, need 50% occ float a0 = src[indx]; float a1 = src[indx+blockDim.x]; dest[indx] = a0; dst[index+blockDim.x] = a1; Copy 4 floats /thread, need 25% occ
int indx = threadIdx.x + 4 * block * blockDim.x; float a[4]; // in registers for(i=0;i<4;i++) a[i]=src[indx+i*blockDim.x]; for(i=0;i<4;i++) dst[indx+i*blockDim.x]=a[i];
Scott B. Baden /CSE 260/ Winter 2014 4
λ p
![Page 5: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/5.jpg)
Today’s lecture
• Thread divergence • Scheduling • Instruction Level Parallelism
Scott B. Baden /CSE 260/ Winter 2014 5
![Page 6: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/6.jpg)
6
Recapping: warp scheduling on Fermi • Threads assigned to an SM in units of
a thread block, multiple blocks • Each block divided into warps of 32
(SIMD) threads, a schedulable unit u A warp becomes eligible for execution
when all its operands are available u Dynamic instruction reordering: eligible
warps selected for execution using a prioritized scheduling policy
u All threads in a warp execute the same instruction, branches serialize execution
• Multiple warps simultaneously active, hiding data transfer delays
• All registers in all the warps are available, 0 overhead scheduling
• Hardware is free to assign blocks to any SM
t0 t1 t2 … tm
warp 8 instruction 11
warp 1 instruction 42
warp 3 instruction 95
warp 8 instruction 12
. . .
time
warp 3 instruction 96
Shared /L1
MT IU
t0 t1 t2 … tm
t0 t1 t2 … tm
Scott B. Baden /CSE 260/ Winter 2014 6
![Page 7: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/7.jpg)
Instruction Issue on Fermi • Each vector unit
• 32 CUDA cores for integer and floating-point arithmetic • 4 special function units for Single Precision transcendentals
• 2 Warp schedulers: each scheduler issues: 1 (2) instructions for capability 2.0 (2.1)
• One scheduler in charge of odd warp IDs, the other even warp IDs
• Only 1 scheduler can issue a double-precision floating-point instruction at a time
• Warp scheduler can issue an instruction to ½ the CUDA cores in an SM
• Scheduler must issue the instruction over 2 clock cycles for an integer or floating-point arithmetic instructions
Scott B. Baden /CSE 260/ Winter 2014 7
![Page 8: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/8.jpg)
Dynamic behavior – resource utilization • Each vector core (SM):
1024 thread slots and 8 block slots • Hardware partitions slots into blocks
at run time, accommodates different processing capacities
• Registers are split dynamically across all blocks assigned to the vector core
• A register is private to a single thread within a single block Courtesy DavidKirk/NVIDIA and Wen-mei Hwu/UIUC
Scott B. Baden /CSE 260/ Winter 2014 8
![Page 9: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/9.jpg)
Thread Divergence • All the threads in a warp execute the same
instruction • Different control paths are serialized • Divergence when a predicate
is a function of the thread Id if (threadId < 2) { }
• No divergence if all follow the same path
if (threadId / WARP_SIZE < 2) { }
• Consider reduction, e.g. summation ∑i xi
warp 8 instruction 11
warp 1 instruction 42
warp 3 instruction 95
warp 8 instruction 12
. . .
warp 3 instruction 96
Scott B. Baden /CSE 260/ Winter 2014 9
![Page 10: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/10.jpg)
Thread divergence • All the threads in a warp execute the same
instruction • Different control paths are serialized
Branch
Path A
Path B
Branch
Path A
Path B
Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt, UBC
Warp
Scalar Thread
Scalar Thread
Scalar Thread
Scalar Thread
Thread Warp 3 Thread Warp 8
Thread Warp 7
Scott B. Baden /CSE 260/ Winter 2014 10
![Page 11: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/11.jpg)
Divergence example
if (threadIdx >= 2) a=100; else a=-100;
P0 P! PM-‐1
Reg
... Reg Reg compare
threadIdx,2
Mary Hall
Scott B. Baden /CSE 260/ Winter 2014 11
![Page 12: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/12.jpg)
Divergence example
if (threadIdx >= 2) a=100; else a=-100;
Mary Hall
P0 P! PM-‐1
Reg
... Reg Reg
X X ✔ ✔
Scott B. Baden /CSE 260/ Winter 2014 12
![Page 13: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/13.jpg)
Divergence example
if (threadIdx >= 2) a=100; else a=-100;
Mary Hall
P0 P! PM-‐1
Reg
... Reg Reg
✔ ✔ X X
Scott B. Baden /CSE 260/ Winter 2014 13
![Page 14: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/14.jpg)
Thread Divergence • All the threads in a warp execute the same
instruction • Different control paths are serialized • Divergence when a predicate
is a function of the threadId if (threadId < 2) { }
• No divergence if all follow the same path within a warp
if (threadId / WARP_SIZE < 2) { }
• We can have different control paths within the thread block
warp 8 instruction 11
warp 1 instruction 42
warp 3 instruction 95
warp 8 instruction 12
. . .
warp 3 instruction 96
Scott B. Baden /CSE 260/ Winter 2014 14
![Page 15: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/15.jpg)
Example – reduction – thread divergence 4 7 5 9 11 14
25
3 1 7 0 4 1 6 3
0! 1! 2! 3! 4! 5! 7!6! 10!9!8! 11!
0+1! 2+3! 4+5! 6+7! 10+11!8+9!
0...3! 4..7! 8..11!
0..7! 8..15!
1!
2!
3!
Thread 0! Thread 8!Thread 2! Thread 4! Thread 6! Thread 10!
DavidKirk/NVIDIA & Wen-mei Hwu/UIUC Scott B. Baden /CSE 260/ Winter 2014 15
![Page 16: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/16.jpg)
The naïve code __global__ void reduce(int *input, unsigned int N, int *total){ unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
__shared__ int x[BSIZE]; x[tid] = (i<N) ? input[i] : 0; __syncthreads();
for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (tid % (2*stride) == 0) x[tid] += x[tid + stride]; } if (tid == 0) atomicAdd(total,x[tid]); }
Scott B. Baden /CSE 260/ Winter 2014 16
![Page 17: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/17.jpg)
Reducing divergence and avoiding bank conflicts
DavidKirk/NVIDIA & Wen-mei Hwu/UIUC
Thread 0!
0! 1! 2! 3! …! 13! 15!14! 18!17!16! 19!
0+16! 15+31!
Scott B. Baden /CSE 260/ Winter 2014 18
![Page 18: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/18.jpg)
The improved code
for (stride = 1; stride < blockDim.x; stride *= 2) {
__syncthreads(); if (tid % (2*stride) == 0)
x[tid] += x[tid + stride]; }
__shared__ int x[ ]; unsigned int tid = threadIdx.x; unsigned int s; for (s = blockDim.x/2; s > 1; s /= 2) { __syncthreads(); if (tid < s ) x[tid] += x[tid + s ]; }
• All threads in a warp execute the same instruction reduceSum <<<N/512,512>>> (x,N) • No divergence until stride < 32 • All warps active when stride ≥ 32
Scott B. Baden /CSE 260/ Winter 2014 19
s = 256: threads 0:255 s = 128: threads 0:127 s = 64: threads 0:63 s = 32: threads 0:31 s = 16: threads 0:15 …
![Page 19: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/19.jpg)
Predication on Fermi • All instructions support predication in 2.x • Condition code or predicate per thread:
set to true or false • Execute only if the predicate is true
if (x>1) y = 7;
• Compiler replaces a branch instruction with predicated instructions only if the number of instructions controlled by branch condition is not too large
• If the compiler predicts too many divergent warps…. threshold = 7, else 4
Scott B. Baden /CSE 260/ Winter 2014 20
test = (x>1) test: y=7
![Page 20: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/20.jpg)
Concurrency – Host & Device • Nonbocking operations
• Asynchronous Device↔ {Host,Device} • Kernel launch
• Interleaved CPU and device calls: kernel calls run asynchronous with respect to host
• But a kernel call sequence runs sequentially on device • Multiple kernel invocations running in separate CUDA
streams: interleaved, independent computations cudaStream_t st1, st2, st3, st4; cudaStreamCreate(&st1)… cudaMemcpyAsync(d1, h1, N, H2D, st1); kernel2<<<Grid,Block, 0, st2 >>> ( …,d2,.. ); kernel3<<<Grid,Block, 0, st3 >>> ( …,d3,.. ); cudaMemcpyAsync(h4, d4, N, D2H, st4); CPU_function(…);
on-demand.gputechconf.com/gtc-express/2011/presentations/StreamsAndConcurrencyWebinar.pdf
Scott B. Baden /CSE 260/ Winter 2014 21
![Page 21: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/21.jpg)
Today’s lecture
• Bank conflicts (previous lecture) • Further improvements to Matrix
Multiplication • Thread divergence • Scheduling
Scott B. Baden /CSE 260/ Winter 2014 22
![Page 22: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/22.jpg)
23
Dual warp scheduling on Fermi • 2 schedulers/SM find an eligible warp
u Each issues 1 instruction per cycle u Dynamic instruction reordering: eligible
warps selected for execution using a prioritized scheduling policy
u Issue selection: round-robin/age of warp u Odd/even warps u Warp scheduler can issue instruction to ½
the cores, each scheduler issues: 1 (2) instructions for capability 2.0 (2.1)
u 16 load/store units u Scheduler must issue the instruction over
2 clock cycles for an integer or floating-point arithmetic instruction
u Most instructions dual issued 2int/2 FP, int/FP/ ld/st
u Only 1 double prec instruction at a time • All registers in all the warps are
available, 0 overhead scheduling • Overhead may be different when
switching blocks
warp 8 instruction 11
warp 1 instruction 42
warp 3 instruction 95
warp 8 instruction 12
. . .
time
warp 3 instruction 96
Scott B. Baden /CSE 260/ Winter 2014 23
![Page 23: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/23.jpg)
What makes a processor run faster? • Registers and cache • Pipelining • Instruction level parallelism • Vectorization, SSE
Scott B. Baden /CSE 260/ Winter 2014 24
![Page 24: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/24.jpg)
Pipelining • Assembly line processing - an auto plant
Dave Patterson’s Laundry example: 4 people doing laundry
wash (30 min) + dry (40 min) + fold (20 min) = 90 min
A
B
C
D
6 PM 7 8 9 Time
30 40 40 40 40 20
• Sequential execution takes 4 * 90min = 6 hours
• Pipelined execution takes 30+4*40+20 = 3.5 hours
• Bandwidth = loads/hour • Pipelining helps bandwidth but
not latency (90 min) • Bandwidth limited by slowest
pipeline stage • Potential speedup = Number
pipe stages
Scott B. Baden /CSE 260/ Winter 2014 25
![Page 25: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/25.jpg)
Instruction level parallelism • Execute more than one instruction at a time
x= y / z; a = b + c;
• Dynamic techniques u Allow stalled instructions to proceed u Reorder instructions
x = y / z;
a = x + c; t = c – q;
x = y / z; t = c – q; a = x + c;
Scott B. Baden /CSE 260/ Winter 2014 26
![Page 26: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/26.jpg)
Multi-execution pipeline • MIPS R4000
IF ID MEM WB
integer unit
FP/int Multiply
FP adder
FP/int divider
ex
m1 m2 m3 m4 m5 m6 m7
a1 a2 a3 a4
Div (latency = 25)
David Culler
x= y / z; a = b + c; s = q * r; i++;
x = y / z; t = c – q; a = x + c;
Scott B. Baden /CSE 260/ Winter 2014 27
![Page 27: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/27.jpg)
Scoreboarding, Tomasulo’s algorithm • Hardware data structures
keep track of u When instructions complete u Which instructions depend
on the results u When it’s safe to write a reg.
• Deals with data hazards u WAR (Write after read) u RAW (Read after write) u WAW (Write after write)
Instruction Fetch
Issue & Resolve
ex ex
op fetch
Sco
rebo
ard
op fetch
FU
a = b*c x= y – b q = a / y y = x + b
Scott B. Baden /CSE 260/ Winter 2014 28
![Page 28: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/28.jpg)
Scoreboarding • Keep track of all register operands of all instructions in the
Instruction Buffer u Instruction becomes ready after the needed values are written u Eliminates hazards u Ready instructions are eligible for issue
• Decouples the Memory/Processor pipelines u Threads can issue instructions until the scoreboard prevents issue u Allows Memory/Processor ops to proceed in parallel with other
waiting Memory/Processor ops
TB1W1
TB = Thread Block, W = Warp
TB2W1
TB3W1
TB2W1
TB1W1
TB3W2
TB1W2
TB1W3
TB3W2
Time
TB1, W1 stallTB3, W2 stallTB2, W1 stall
Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4
Scott B. Baden /CSE 260/ Winter 2014 29
![Page 29: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/29.jpg)
Static scheduling limits performance • The ADDD instruction is stalled on the DIVide .. • stalling further instruction issue, e.g. the SUBD
DIV F0, F2, F4 ADDD F10, F0, F8 SUBD F12, F8, F14
• But SUBD doesn’t depend on ADDD or DIV • If we have two adder/subtraction units, one will sit
idle uselessly until the DIV finishes
DIV
ADDD
SUBD
Scott B. Baden /CSE 260/ Winter 2014 30
![Page 30: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/30.jpg)
Dynamic scheduling • Idea: modify the pipeline to permit
instructions to execute as soon as their operands become available
• This is known as out-of-order execution (classic dataflow)
• The SUBD can now proceed normally • Complications: dynamically scheduled
instructions also complete out of order
Scott B. Baden /CSE 260/ Winter 2014 31
![Page 31: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/31.jpg)
Dynamic scheduling splits the ID stage • Issue sub-stage
u Decode the instructions u Check for structural hazards
• Read operands sub-stage u Wait until there are no data hazards u Read operands
• We need additional registers to store pending instructions that aren’t ready to execute
I $
Multithreaded Instruction Buffer
R F C $
L 1 Shared Mem
Operand Select
MAD SFU Mary Hall
Scott B. Baden /CSE 260/ Winter 2014 32
![Page 32: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/32.jpg)
Consequences of a split ID stage • We distinguish between the time when an
instruction begins execution, and when it completes
• Previously, an instruction stalled in the ID stage, and this held up the entire pipeline
• Instructions can now be in a suspended state, neither stalling the pipeline, nor executing
• They are waiting on operands
Scott B. Baden /CSE 260/ Winter 2014 33
![Page 33: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/33.jpg)
Two schemes for dynamic scheduling • Scoreboard
u CDC 66000
• Tomasulo’s algorithm u IBM 360/91
• We’ll vary the number of functional units, their latency, and functional unit pipelining
Scott B. Baden /CSE 260/ Winter 2014 34
![Page 34: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/34.jpg)
What is a scoreboarding? • A technique that allows instructions to
execute out of order… u So long as there are sufficient resources and u No data dependencies
• The goal of scoreboarding u Maintain an execution rate of one instruction
per clock cycle
Scott B. Baden /CSE 260/ Winter 2014 35
![Page 35: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/35.jpg)
Multiple execution pipelines in DLX with scoreboarding
Unit Latency
Integer 0
Memory 1
FP Add 2
FP Multiply 10
FP Div 40
Scott B. Baden /CSE 260/ Winter 2014 36
• • • • •
Scoreboard
Integer unit
FP add
FP divide
FP mult
FP mult
Control/status Control/status
Registers Data buses
![Page 36: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/36.jpg)
Scoreboard controls the pipeline
Scott B. Baden /CSE 260/ Winter 2014 37
Instruction Fetch
Instruction Issue
Pre-issue buffer
Execution unit 1
Execution unit 2 …
Write results
Pre-execution buffers
Post-execution buffers
Read operands
Read operands
Scoreboard / Control Unit
Instruction Decode
WAW
RAW
WAR
Mike Frank
![Page 37: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/37.jpg)
What are the requirements? • Responsibility for instruction issue and
execution, including hazard detection • Multiple instructions must be in the EX
stage simultaneously • Either through pipelining or multiple
functional units • DLX has: 2 multipliers, 1 divider, 1 integer
unit (memory, branch, integer arithmetic)
Scott B. Baden /CSE 260/ Winter 2014 38
![Page 38: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/38.jpg)
How does it work? • As each instruction passes through the scoreboard,
construct a description of the data dependencies (Issue)
• Scoreboard determines when the instruction can read operands and begin execution
• If the instruction can’t begin execution, the scoreboard keeps a record, and it listens for one the instruction can execute
• Also controls when an instruction may write its result
• All hazard detection is centralized
Scott B. Baden /CSE 260/ Winter 2014 39
![Page 39: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes](https://reader030.vdocument.in/reader030/viewer/2022040712/5e176b9cee70db24c74777af/html5/thumbnails/39.jpg)
How is a scoreboard implemented? • A centralized bookkeeping table • Tracks instructions, along with register
operand(s) they depend on and which register they modify
• Status of result registers (who is going to write to a given register)
• Status of the functional units
Scott B. Baden /CSE 260/ Winter 2014 40