spring 2014jim hogg - uw - cse - p501o-1 cse p501 – compiler construction instruction scheduling...

Spring 2014 Jim Hogg - UW - CSE - P501 O-1

CSE P501 – Compiler Construction

Instruction Scheduling Issues

Latencies

List scheduling

Instruction Scheduling is . . .

Spring 2014 Jim Hogg - UW - CSE - P501 O-2

Schedule

• Execute in-order to get correct answer

abcdefgh

badfcghf

• Issue in new order• eg: memory fetch is

slow• eg: divide is slow• Overall faster• Still get correct

answer!

Originally devised for super-computersNow used everywhere:

• in-order procs - older ARM• out-of-order procs - newer x86

Compiler does 'heavy lifting' - reduce chip power

Spring 2014 JIm Hogg - UW - CSE - P501 O-3

Chip Complexity, 1

Following factors make scheduling complicated:

Different kinds of instruction take different times (in clock cycles) to complete

Modern chips have multiple functional units so they can issue several operations per cycle "super-scalar"

Loads are non-blocking ~50 in-flight loads and ~50 in-flight stores

Typical Instruction Timings


Instruction Time in Clock Cycles

int int 1

int * int 3

float float 3

float * float 5

float float 15

int int 30

Load Latencies

T-5

Core Core

L1 = 64 KB per core

L2 = 256 KB per core

L3 = 2-8 MB shared

DRAM

Instruction ~5 per cycle

Register 1 cycle L1 Cache ~4

cycles L2 Cache ~10

cycles L3 Cache ~40

cycles DRAM ~100 ns


Super-Scalar


Chip Complexity, 2

Branch costs vary (branch predictor)

Branches on some processors have delay slots (eg: Sparc)

Modern processors have branch-predictor logic in hardware

heuristics predict whether branches are taken or not keeps pipelines full

GOAL: Scheduler should reorder instructions to hide latencies take advantage of multiple function units (and delay

slots) help the processor effectively pipeline execution

However, many chips schedule on-the-fly too eg: Haswell out-of-order window = 192 ops

Data Dependence Graph


a

i

c

g

h

d

f

b

e

Start Cycle

Instruction

a loadAI rarp, @a => r1

b add r1, r1 => r1

c loadAI rarp, @b => r2

d mult r1, r2 => r1

e loadAI rarp, @c => r2

f mult r1, r2 => r1

g loadAI rarp, @d => r2

h mult r1, r2 => r1

i storeAI r1 => rarp, @a

read-after-write = RAW = true dependence = flow dependencewrite-after-read = WAR = anti-dependencewrite-after-write = WAW = output-dependence

The scheduler has freedom to re-order instructions, so long as it complies with inter-instruction dependencies

leaf

root

Scheduling Really Works ...


Start Cycle

Instruction

1 loadAI rarp, @a => r1

4 add r1, r1 => r1

5 loadAI rarp, @b => r2

8 mult r1, r2 => r1

10 loadAI rarp, @c => r2

13 mult r1, r2 => r1

15 loadAI rarp, @d => r2

18 mult r1, r2 => r1

20 storeAI r1 => rarp, @a

Start Cycle

Instruction

1 loadAI rarp, @a => r1

2 loadAI rarp, @b => r2

3 loadAI rarp, @c => r3

4 add r1, r1 => r1

5 mult r1, r2 => r1

6 loadAI rarp, @d => r2

7 mult r1, r3 => r1

9 mult r1, r2 => r1

11 storeAI r1 => rarp, @a

Original Scheduled

1 Functional UnitLoad or Store: 3 cyclesMultiply: 2 cyclesOtherwise: 1 cycle

a = 2*a*b*c*d New schedule uses extra register: r3

Preserves (WAW) output-dependency


Scheduler: Job Description

The Job Given code for some machine; and latencies for each

instruction, reorder to minimize execution time

Constraints Produce correct code Minimize wasted cycles Avoid spilling registers Don't take forever to reach an answer

Job Description - Part 2

foreach instruction in dependence graph

Denote current instruction as ins

Denote number of cyles to execute as ins.delay

Denote cycle number in which ins should start as ins.start

foreach instruction dep that is dependent on ins

Ensure ins.start + ins.delay <= dep.start


What if the scheduler makes a mistake?

On-chip hardware stalls the pipeline until operands become available: so slower, but still correct!

Dependence Graph + Timings


a13

i3

c1

2

g8

h5

d9

f7

b10

e1

0

Start Cycle

Instruction

a loadAI rarp, @a => r1

b add r1, r1 => r1

c loadAI rarp, @b => r2

d mult r1, r2 => r1

e loadAI rarp, @c => r2

f mult r1, r2 => r1

g loadAI rarp, @d => r2

h mult r1, r2 => r1

i storeAI r1 => rarp, @a

1 Functional UnitLoad or Store: 3 cyclesMultiply: 2 cyclesOtherwise: 1 cycle

• Superscripts show path length to end of computation• a-b-d-f-h-i is critical path• Can schedule leaves any time - no constraints• Since a has longest delay, schedule it first; then c;

then ...


List Scheduling

Build a precedence graph D

Compute a priority function over the nodes in D

typical: longest latency-weighted path

Rename registers to remove WAW conflicts

Create schedule, one cycle at a time Use queue of operations that are Ready At each cycle

Choose a Ready operation and schedule it Update Ready queue

O-14

List Scheduling Algorithm

cycle = 1 // clock cycle numberReady = leaves of D // ready to be

scheduledActive = { } // being executed

while Ready Active {} do foreach ins Active do if ins.start + ins.delay < cycle then remove ins from Active foreach successor suc of ins in D do if suc Ready then Ready = {suc} endif enddo endif endforeach if Ready {} then remove an instruction, ins, from Ready ins.start = cycle; Active = ins; endif cycle++endwhile

Beyond Basic Blocks

List scheduling dominates, but moving beyond basic blocks can improve quality of the code. Possibilities:

Schedule extended basic blocks (EBBs) Watch for exit points – limits reordering or requires

compensating

Trace scheduling Use profiling information to select regions for scheduling

using traces (paths) through code


spring 2014jim hogg - uw - cse - p501o-1 cse p501 – compiler construction instruction scheduling...

Documents

cse p501o

r27mult r1

r19mult r1

r2hmult r1

r218mult r1

r2dmult r1

r28mult r1

r2fmult r1