spring 2014jim hogg - uw - cse - p501o-1 cse p501 – compiler construction instruction scheduling...
TRANSCRIPT
Spring 2014 Jim Hogg - UW - CSE - P501 O-1
CSE P501 – Compiler Construction
Instruction Scheduling Issues
Latencies
List scheduling
Instruction Scheduling is . . .
Spring 2014 Jim Hogg - UW - CSE - P501 O-2
Schedule
• Execute in-order to get correct answer
abcdefgh
badfcghf
• Issue in new order• eg: memory fetch is
slow• eg: divide is slow• Overall faster• Still get correct
answer!
Originally devised for super-computersNow used everywhere:
• in-order procs - older ARM• out-of-order procs - newer x86
Compiler does 'heavy lifting' - reduce chip power
Spring 2014 JIm Hogg - UW - CSE - P501 O-3
Chip Complexity, 1
Following factors make scheduling complicated:
Different kinds of instruction take different times (in clock cycles) to complete
Modern chips have multiple functional units so they can issue several operations per cycle "super-scalar"
Loads are non-blocking ~50 in-flight loads and ~50 in-flight stores
Typical Instruction Timings
Spring 2014 JIm Hogg - UW - CSE - P501 O-4
Instruction Time in Clock Cycles
int int 1
int * int 3
float float 3
float * float 5
float float 15
int int 30
Load Latencies
T-5
Core Core
L1 = 64 KB per core
L2 = 256 KB per core
L3 = 2-8 MB shared
DRAM
Instruction ~5 per cycle
Register 1 cycle L1 Cache ~4
cycles L2 Cache ~10
cycles L3 Cache ~40
cycles DRAM ~100 ns
Spring 2014 JIm Hogg - UW - CSE - P501 O-6
Super-Scalar
Spring 2014 JIm Hogg - UW - CSE - P501 O-7
Chip Complexity, 2
Branch costs vary (branch predictor)
Branches on some processors have delay slots (eg: Sparc)
Modern processors have branch-predictor logic in hardware
heuristics predict whether branches are taken or not keeps pipelines full
GOAL: Scheduler should reorder instructions to hide latencies take advantage of multiple function units (and delay
slots) help the processor effectively pipeline execution
However, many chips schedule on-the-fly too eg: Haswell out-of-order window = 192 ops
Data Dependence Graph
Spring 2014 JIm Hogg - UW - CSE - P501 O-8
a
i
c
g
h
d
f
b
e
Start Cycle
Instruction
a loadAI rarp, @a => r1
b add r1, r1 => r1
c loadAI rarp, @b => r2
d mult r1, r2 => r1
e loadAI rarp, @c => r2
f mult r1, r2 => r1
g loadAI rarp, @d => r2
h mult r1, r2 => r1
i storeAI r1 => rarp, @a
read-after-write = RAW = true dependence = flow dependencewrite-after-read = WAR = anti-dependencewrite-after-write = WAW = output-dependence
The scheduler has freedom to re-order instructions, so long as it complies with inter-instruction dependencies
leaf
root
Scheduling Really Works ...
Spring 2014 JIm Hogg - UW - CSE - P501 O-9
Start Cycle
Instruction
1 loadAI rarp, @a => r1
4 add r1, r1 => r1
5 loadAI rarp, @b => r2
8 mult r1, r2 => r1
10 loadAI rarp, @c => r2
13 mult r1, r2 => r1
15 loadAI rarp, @d => r2
18 mult r1, r2 => r1
20 storeAI r1 => rarp, @a
Start Cycle
Instruction
1 loadAI rarp, @a => r1
2 loadAI rarp, @b => r2
3 loadAI rarp, @c => r3
4 add r1, r1 => r1
5 mult r1, r2 => r1
6 loadAI rarp, @d => r2
7 mult r1, r3 => r1
9 mult r1, r2 => r1
11 storeAI r1 => rarp, @a
Original Scheduled
1 Functional UnitLoad or Store: 3 cyclesMultiply: 2 cyclesOtherwise: 1 cycle
a = 2*a*b*c*d New schedule uses extra register: r3
Preserves (WAW) output-dependency
Spring 2014 JIm Hogg - UW - CSE - P501 O-10
Scheduler: Job Description
The Job Given code for some machine; and latencies for each
instruction, reorder to minimize execution time
Constraints Produce correct code Minimize wasted cycles Avoid spilling registers Don't take forever to reach an answer
Job Description - Part 2
foreach instruction in dependence graph
Denote current instruction as ins
Denote number of cyles to execute as ins.delay
Denote cycle number in which ins should start as ins.start
foreach instruction dep that is dependent on ins
Ensure ins.start + ins.delay <= dep.start
Spring 2014 JIm Hogg - UW - CSE - P501 O-11
What if the scheduler makes a mistake?
On-chip hardware stalls the pipeline until operands become available: so slower, but still correct!
Dependence Graph + Timings
Spring 2014 JIm Hogg - UW - CSE - P501 O-12
a13
i3
c1
2
g8
h5
d9
f7
b10
e1
0
Start Cycle
Instruction
a loadAI rarp, @a => r1
b add r1, r1 => r1
c loadAI rarp, @b => r2
d mult r1, r2 => r1
e loadAI rarp, @c => r2
f mult r1, r2 => r1
g loadAI rarp, @d => r2
h mult r1, r2 => r1
i storeAI r1 => rarp, @a
1 Functional UnitLoad or Store: 3 cyclesMultiply: 2 cyclesOtherwise: 1 cycle
• Superscripts show path length to end of computation• a-b-d-f-h-i is critical path• Can schedule leaves any time - no constraints• Since a has longest delay, schedule it first; then c;
then ...
Spring 2014 JIm Hogg - UW - CSE - P501 O-13
List Scheduling
Build a precedence graph D
Compute a priority function over the nodes in D
typical: longest latency-weighted path
Rename registers to remove WAW conflicts
Create schedule, one cycle at a time Use queue of operations that are Ready At each cycle
Choose a Ready operation and schedule it Update Ready queue
O-14
List Scheduling Algorithm
cycle = 1 // clock cycle numberReady = leaves of D // ready to be
scheduledActive = { } // being executed
while Ready Active {} do foreach ins Active do if ins.start + ins.delay < cycle then remove ins from Active foreach successor suc of ins in D do if suc Ready then Ready = {suc} endif enddo endif endforeach if Ready {} then remove an instruction, ins, from Ready ins.start = cycle; Active = ins; endif cycle++endwhile
Beyond Basic Blocks
List scheduling dominates, but moving beyond basic blocks can improve quality of the code. Possibilities:
Schedule extended basic blocks (EBBs) Watch for exit points – limits reordering or requires
compensating
Trace scheduling Use profiling information to select regions for scheduling
using traces (paths) through code
Spring 2014 JIm Hogg - UW - CSE - P501 O-15