2015/6/11\course\cpeg421-10f\topic2a.ppt1 topic 2a basic back-end optimization instruction selection...

56
111/03/27 \course\cpeg421-10F\Topic2a.ppt 1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

Post on 19-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 1

Topic 2a Basic Back-End Optimization

Instruction SelectionInstruction schedulingRegister allocation

1120418 coursecpeg421-10FTopic2appt 2

ABET Outcome

Ability to apply knowledge of basic code generation techniques eg Instruction scheduling register allocation to solve code generation problems

An ability to identify formulate and solve loops scheduling problems using software pipelining techniques

Ability to analyze the basic algorithms on the above techniques and conduct experiments to show their effectiveness

Ability to use a modern compiler development platform and tools for the practice of above

A Knowledge on contemporary issues on this topic

1120418 coursecpeg421-10FTopic2appt 3

Reading ListReading ListReading ListReading List

(1) K D Cooper amp L Torczon Engineering a Compiler Chapter 12

(2) Dragon Book Chapter 101 ~ 104

1120418 coursecpeg421-10FTopic2appt 4

A Short Tour on A Short Tour on

Data DependenceData Dependence

A Short Tour on A Short Tour on

Data DependenceData Dependence

1120418 coursecpeg421-10FTopic2appt 5

Basic Concept and MotivationBasic Concept and MotivationBasic Concept and MotivationBasic Concept and Motivation

Data dependence between 2 accesses

The same memory location

Exist an execution path between them

At least one of them is a write

Three types of data dependencies

Dependence graphs

Things are not simple when dealing with loops

1120418 coursecpeg421-10FTopic2appt 6

Data Dependencies

There is a data dependence between statements Si and Sj if and only if Both statements access the same

memory location and at least one of the statements writes into it and

There is a feasible run-time execution path from Si to Sj

1120418 coursecpeg421-10FTopic2appt 7

Types of Types of Data Dependencies

Flow (true) Dependencies - writeread ()x = 4hellipy = x + 1

Output Dependencies - writewrite (o)x = 4 hellipx = y + 1

Anti-dependencies - readwrite (-1)y = x + 1hellipx = 4

-1--

0

1120418 coursecpeg421-10FTopic2appt 8

(1) x = 4(2) y = 6(3) p = x + 2(4) z = y + p(5) x = z(6) y = p

x = 4 y = 6

p = x + 2

z = y + p

y = p x = zFlow

An Example of An Example of Data Dependencies

OutputAnti

1120418 coursecpeg421-10FTopic2appt 9

Data Dependence Graph (DDG)

Forms a data dependence graph between statements nodes = statements edges = dependence relation (type label)

1120418 coursecpeg421-10FTopic2appt 10

Reordering Transformations using DDG

Given a correct data dependence graph any order-based optimization that does not change the dependences of a program is guaranteed not to change the results of the program

1120418 coursecpeg421-10FTopic2appt 11

Reordering Transformations

A reordering transformation is any program transformation that merely changes the order of execution of the code without adding or deleting any executions of any statements

A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence

1120418 coursecpeg421-10FTopic2appt 12

Instruction Scheduling Loop restructuring Exploiting Parallelism

Analyze array references to determine whether two iterations access the same memory location Iterations I1 and I2 can be safely executed in parallel if there is no data dependency between them

hellip

Reordering Transformations (Conrsquot)

1120418 coursecpeg421-10FTopic2appt 13

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

Example 1

S1 A = 0

S2 B = A

S3 C = A + D

S4 D = 2

Sx Sy flow dependence

S1

S2

S3

S4

1120418 coursecpeg421-10FTopic2appt 14

Example 2

S1 A = 0S2 B = AS3 A = B + 1S4 C = A

S1

S2

S3

S4

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

1120418 coursecpeg421-10FTopic2appt 15

Should we consider input Should we consider input dependencedependence

Should we consider input Should we consider input dependencedependence

= X

= X

Is the reading of the same X important

Well it may be(if we intend to group the 2 reads together for cache optimization)

1120418 coursecpeg421-10FTopic2appt 16

- register allocation- instruction scheduling- loop scheduling- vectorization- parallelization - memory hierarchy optimization

Applications of Data Applications of Data Dependence GraphDependence Graph

Applications of Data Applications of Data Dependence GraphDependence Graph

1120418 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 -1 s3 s2 s3

(s3) a = x - 2

(s4) end do s3 s2 (next iteration)

1120418 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units Instruction

scheduling can improve the performance of a

program by placing independent target

instructions in parallel or adjacent positions

Motivation

1120418 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we have finished optimizing the IR

bull Instruction scheduling attempts to reorder the codes for maximum instruction-level parallelism (ILP)

bull It is one of the instruction-level optimizationsbull Instruction scheduling (IS) in general is NP-complete so heuristics must be used

Instruction Schedular

Original Code Reordered Code

1120418 coursecpeg421-10FTopic2appt 20

Instruction schedulingA Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can execute them in parallel assuming adequate hardware processing resources

time

a = 1 + x b = 2 + y c = 3 + z

1120418 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in modern hardware

bullpipeliningbullsuperscalar processingbullVLIWbullmultiprocessing

Of these the first three forms are commonly exploited by todayrsquos compilersrsquo instruction scheduling phase

1120418 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be overlapped It has the same principle as the assembly line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is accomplished by adding more hardware for parallel execution of stages and for dispatching instructions to them

1120418 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch- decode and register fetch- execute on ALU- memory access- write back to register file

1120418 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each instruction is in a different stage but every stage is active

The pipeline is ldquofullrdquo here

1120418 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Examplei1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 0(r1)i4 add r5 r3 r4

Consider two possible instruction schedules (permutations)`

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

1120418 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in the same pipeline stage at the same time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

1120418 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints If instruction A computes a value that is read by instruction B then B canrsquot execute before A is completedResource hazards Finiteness of hardware function units means limited parallelism

1120418 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources bull finite set of FUs with instruction type and width and latency constraints bull cache hierarchy also has many constraints1048707 Data Dependences bull canrsquot consume a result before it is produced bull ambiguous dependences create many challenges1048707 Control Dependences bull impractical to schedule for all possible paths bull choosing an ldquoexpectedrdquo path may be difficult bull recovery costs can be non-trivial if you are wrong

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 2: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 2

ABET Outcome

Ability to apply knowledge of basic code generation techniques eg Instruction scheduling register allocation to solve code generation problems

An ability to identify formulate and solve loops scheduling problems using software pipelining techniques

Ability to analyze the basic algorithms on the above techniques and conduct experiments to show their effectiveness

Ability to use a modern compiler development platform and tools for the practice of above

A Knowledge on contemporary issues on this topic

1120418 coursecpeg421-10FTopic2appt 3

Reading ListReading ListReading ListReading List

(1) K D Cooper amp L Torczon Engineering a Compiler Chapter 12

(2) Dragon Book Chapter 101 ~ 104

1120418 coursecpeg421-10FTopic2appt 4

A Short Tour on A Short Tour on

Data DependenceData Dependence

A Short Tour on A Short Tour on

Data DependenceData Dependence

1120418 coursecpeg421-10FTopic2appt 5

Basic Concept and MotivationBasic Concept and MotivationBasic Concept and MotivationBasic Concept and Motivation

Data dependence between 2 accesses

The same memory location

Exist an execution path between them

At least one of them is a write

Three types of data dependencies

Dependence graphs

Things are not simple when dealing with loops

1120418 coursecpeg421-10FTopic2appt 6

Data Dependencies

There is a data dependence between statements Si and Sj if and only if Both statements access the same

memory location and at least one of the statements writes into it and

There is a feasible run-time execution path from Si to Sj

1120418 coursecpeg421-10FTopic2appt 7

Types of Types of Data Dependencies

Flow (true) Dependencies - writeread ()x = 4hellipy = x + 1

Output Dependencies - writewrite (o)x = 4 hellipx = y + 1

Anti-dependencies - readwrite (-1)y = x + 1hellipx = 4

-1--

0

1120418 coursecpeg421-10FTopic2appt 8

(1) x = 4(2) y = 6(3) p = x + 2(4) z = y + p(5) x = z(6) y = p

x = 4 y = 6

p = x + 2

z = y + p

y = p x = zFlow

An Example of An Example of Data Dependencies

OutputAnti

1120418 coursecpeg421-10FTopic2appt 9

Data Dependence Graph (DDG)

Forms a data dependence graph between statements nodes = statements edges = dependence relation (type label)

1120418 coursecpeg421-10FTopic2appt 10

Reordering Transformations using DDG

Given a correct data dependence graph any order-based optimization that does not change the dependences of a program is guaranteed not to change the results of the program

1120418 coursecpeg421-10FTopic2appt 11

Reordering Transformations

A reordering transformation is any program transformation that merely changes the order of execution of the code without adding or deleting any executions of any statements

A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence

1120418 coursecpeg421-10FTopic2appt 12

Instruction Scheduling Loop restructuring Exploiting Parallelism

Analyze array references to determine whether two iterations access the same memory location Iterations I1 and I2 can be safely executed in parallel if there is no data dependency between them

hellip

Reordering Transformations (Conrsquot)

1120418 coursecpeg421-10FTopic2appt 13

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

Example 1

S1 A = 0

S2 B = A

S3 C = A + D

S4 D = 2

Sx Sy flow dependence

S1

S2

S3

S4

1120418 coursecpeg421-10FTopic2appt 14

Example 2

S1 A = 0S2 B = AS3 A = B + 1S4 C = A

S1

S2

S3

S4

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

1120418 coursecpeg421-10FTopic2appt 15

Should we consider input Should we consider input dependencedependence

Should we consider input Should we consider input dependencedependence

= X

= X

Is the reading of the same X important

Well it may be(if we intend to group the 2 reads together for cache optimization)

1120418 coursecpeg421-10FTopic2appt 16

- register allocation- instruction scheduling- loop scheduling- vectorization- parallelization - memory hierarchy optimization

Applications of Data Applications of Data Dependence GraphDependence Graph

Applications of Data Applications of Data Dependence GraphDependence Graph

1120418 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 -1 s3 s2 s3

(s3) a = x - 2

(s4) end do s3 s2 (next iteration)

1120418 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units Instruction

scheduling can improve the performance of a

program by placing independent target

instructions in parallel or adjacent positions

Motivation

1120418 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we have finished optimizing the IR

bull Instruction scheduling attempts to reorder the codes for maximum instruction-level parallelism (ILP)

bull It is one of the instruction-level optimizationsbull Instruction scheduling (IS) in general is NP-complete so heuristics must be used

Instruction Schedular

Original Code Reordered Code

1120418 coursecpeg421-10FTopic2appt 20

Instruction schedulingA Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can execute them in parallel assuming adequate hardware processing resources

time

a = 1 + x b = 2 + y c = 3 + z

1120418 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in modern hardware

bullpipeliningbullsuperscalar processingbullVLIWbullmultiprocessing

Of these the first three forms are commonly exploited by todayrsquos compilersrsquo instruction scheduling phase

1120418 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be overlapped It has the same principle as the assembly line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is accomplished by adding more hardware for parallel execution of stages and for dispatching instructions to them

1120418 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch- decode and register fetch- execute on ALU- memory access- write back to register file

1120418 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each instruction is in a different stage but every stage is active

The pipeline is ldquofullrdquo here

1120418 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Examplei1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 0(r1)i4 add r5 r3 r4

Consider two possible instruction schedules (permutations)`

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

1120418 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in the same pipeline stage at the same time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

1120418 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints If instruction A computes a value that is read by instruction B then B canrsquot execute before A is completedResource hazards Finiteness of hardware function units means limited parallelism

1120418 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources bull finite set of FUs with instruction type and width and latency constraints bull cache hierarchy also has many constraints1048707 Data Dependences bull canrsquot consume a result before it is produced bull ambiguous dependences create many challenges1048707 Control Dependences bull impractical to schedule for all possible paths bull choosing an ldquoexpectedrdquo path may be difficult bull recovery costs can be non-trivial if you are wrong

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 3: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 3

Reading ListReading ListReading ListReading List

(1) K D Cooper amp L Torczon Engineering a Compiler Chapter 12

(2) Dragon Book Chapter 101 ~ 104

1120418 coursecpeg421-10FTopic2appt 4

A Short Tour on A Short Tour on

Data DependenceData Dependence

A Short Tour on A Short Tour on

Data DependenceData Dependence

1120418 coursecpeg421-10FTopic2appt 5

Basic Concept and MotivationBasic Concept and MotivationBasic Concept and MotivationBasic Concept and Motivation

Data dependence between 2 accesses

The same memory location

Exist an execution path between them

At least one of them is a write

Three types of data dependencies

Dependence graphs

Things are not simple when dealing with loops

1120418 coursecpeg421-10FTopic2appt 6

Data Dependencies

There is a data dependence between statements Si and Sj if and only if Both statements access the same

memory location and at least one of the statements writes into it and

There is a feasible run-time execution path from Si to Sj

1120418 coursecpeg421-10FTopic2appt 7

Types of Types of Data Dependencies

Flow (true) Dependencies - writeread ()x = 4hellipy = x + 1

Output Dependencies - writewrite (o)x = 4 hellipx = y + 1

Anti-dependencies - readwrite (-1)y = x + 1hellipx = 4

-1--

0

1120418 coursecpeg421-10FTopic2appt 8

(1) x = 4(2) y = 6(3) p = x + 2(4) z = y + p(5) x = z(6) y = p

x = 4 y = 6

p = x + 2

z = y + p

y = p x = zFlow

An Example of An Example of Data Dependencies

OutputAnti

1120418 coursecpeg421-10FTopic2appt 9

Data Dependence Graph (DDG)

Forms a data dependence graph between statements nodes = statements edges = dependence relation (type label)

1120418 coursecpeg421-10FTopic2appt 10

Reordering Transformations using DDG

Given a correct data dependence graph any order-based optimization that does not change the dependences of a program is guaranteed not to change the results of the program

1120418 coursecpeg421-10FTopic2appt 11

Reordering Transformations

A reordering transformation is any program transformation that merely changes the order of execution of the code without adding or deleting any executions of any statements

A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence

1120418 coursecpeg421-10FTopic2appt 12

Instruction Scheduling Loop restructuring Exploiting Parallelism

Analyze array references to determine whether two iterations access the same memory location Iterations I1 and I2 can be safely executed in parallel if there is no data dependency between them

hellip

Reordering Transformations (Conrsquot)

1120418 coursecpeg421-10FTopic2appt 13

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

Example 1

S1 A = 0

S2 B = A

S3 C = A + D

S4 D = 2

Sx Sy flow dependence

S1

S2

S3

S4

1120418 coursecpeg421-10FTopic2appt 14

Example 2

S1 A = 0S2 B = AS3 A = B + 1S4 C = A

S1

S2

S3

S4

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

1120418 coursecpeg421-10FTopic2appt 15

Should we consider input Should we consider input dependencedependence

Should we consider input Should we consider input dependencedependence

= X

= X

Is the reading of the same X important

Well it may be(if we intend to group the 2 reads together for cache optimization)

1120418 coursecpeg421-10FTopic2appt 16

- register allocation- instruction scheduling- loop scheduling- vectorization- parallelization - memory hierarchy optimization

Applications of Data Applications of Data Dependence GraphDependence Graph

Applications of Data Applications of Data Dependence GraphDependence Graph

1120418 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 -1 s3 s2 s3

(s3) a = x - 2

(s4) end do s3 s2 (next iteration)

1120418 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units Instruction

scheduling can improve the performance of a

program by placing independent target

instructions in parallel or adjacent positions

Motivation

1120418 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we have finished optimizing the IR

bull Instruction scheduling attempts to reorder the codes for maximum instruction-level parallelism (ILP)

bull It is one of the instruction-level optimizationsbull Instruction scheduling (IS) in general is NP-complete so heuristics must be used

Instruction Schedular

Original Code Reordered Code

1120418 coursecpeg421-10FTopic2appt 20

Instruction schedulingA Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can execute them in parallel assuming adequate hardware processing resources

time

a = 1 + x b = 2 + y c = 3 + z

1120418 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in modern hardware

bullpipeliningbullsuperscalar processingbullVLIWbullmultiprocessing

Of these the first three forms are commonly exploited by todayrsquos compilersrsquo instruction scheduling phase

1120418 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be overlapped It has the same principle as the assembly line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is accomplished by adding more hardware for parallel execution of stages and for dispatching instructions to them

1120418 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch- decode and register fetch- execute on ALU- memory access- write back to register file

1120418 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each instruction is in a different stage but every stage is active

The pipeline is ldquofullrdquo here

1120418 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Examplei1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 0(r1)i4 add r5 r3 r4

Consider two possible instruction schedules (permutations)`

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

1120418 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in the same pipeline stage at the same time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

1120418 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints If instruction A computes a value that is read by instruction B then B canrsquot execute before A is completedResource hazards Finiteness of hardware function units means limited parallelism

1120418 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources bull finite set of FUs with instruction type and width and latency constraints bull cache hierarchy also has many constraints1048707 Data Dependences bull canrsquot consume a result before it is produced bull ambiguous dependences create many challenges1048707 Control Dependences bull impractical to schedule for all possible paths bull choosing an ldquoexpectedrdquo path may be difficult bull recovery costs can be non-trivial if you are wrong

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 4: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 4

A Short Tour on A Short Tour on

Data DependenceData Dependence

A Short Tour on A Short Tour on

Data DependenceData Dependence

1120418 coursecpeg421-10FTopic2appt 5

Basic Concept and MotivationBasic Concept and MotivationBasic Concept and MotivationBasic Concept and Motivation

Data dependence between 2 accesses

The same memory location

Exist an execution path between them

At least one of them is a write

Three types of data dependencies

Dependence graphs

Things are not simple when dealing with loops

1120418 coursecpeg421-10FTopic2appt 6

Data Dependencies

There is a data dependence between statements Si and Sj if and only if Both statements access the same

memory location and at least one of the statements writes into it and

There is a feasible run-time execution path from Si to Sj

1120418 coursecpeg421-10FTopic2appt 7

Types of Types of Data Dependencies

Flow (true) Dependencies - writeread ()x = 4hellipy = x + 1

Output Dependencies - writewrite (o)x = 4 hellipx = y + 1

Anti-dependencies - readwrite (-1)y = x + 1hellipx = 4

-1--

0

1120418 coursecpeg421-10FTopic2appt 8

(1) x = 4(2) y = 6(3) p = x + 2(4) z = y + p(5) x = z(6) y = p

x = 4 y = 6

p = x + 2

z = y + p

y = p x = zFlow

An Example of An Example of Data Dependencies

OutputAnti

1120418 coursecpeg421-10FTopic2appt 9

Data Dependence Graph (DDG)

Forms a data dependence graph between statements nodes = statements edges = dependence relation (type label)

1120418 coursecpeg421-10FTopic2appt 10

Reordering Transformations using DDG

Given a correct data dependence graph any order-based optimization that does not change the dependences of a program is guaranteed not to change the results of the program

1120418 coursecpeg421-10FTopic2appt 11

Reordering Transformations

A reordering transformation is any program transformation that merely changes the order of execution of the code without adding or deleting any executions of any statements

A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence

1120418 coursecpeg421-10FTopic2appt 12

Instruction Scheduling Loop restructuring Exploiting Parallelism

Analyze array references to determine whether two iterations access the same memory location Iterations I1 and I2 can be safely executed in parallel if there is no data dependency between them

hellip

Reordering Transformations (Conrsquot)

1120418 coursecpeg421-10FTopic2appt 13

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

Example 1

S1 A = 0

S2 B = A

S3 C = A + D

S4 D = 2

Sx Sy flow dependence

S1

S2

S3

S4

1120418 coursecpeg421-10FTopic2appt 14

Example 2

S1 A = 0S2 B = AS3 A = B + 1S4 C = A

S1

S2

S3

S4

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

1120418 coursecpeg421-10FTopic2appt 15

Should we consider input Should we consider input dependencedependence

Should we consider input Should we consider input dependencedependence

= X

= X

Is the reading of the same X important

Well it may be(if we intend to group the 2 reads together for cache optimization)

1120418 coursecpeg421-10FTopic2appt 16

- register allocation- instruction scheduling- loop scheduling- vectorization- parallelization - memory hierarchy optimization

Applications of Data Applications of Data Dependence GraphDependence Graph

Applications of Data Applications of Data Dependence GraphDependence Graph

1120418 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 -1 s3 s2 s3

(s3) a = x - 2

(s4) end do s3 s2 (next iteration)

1120418 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units Instruction

scheduling can improve the performance of a

program by placing independent target

instructions in parallel or adjacent positions

Motivation

1120418 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we have finished optimizing the IR

bull Instruction scheduling attempts to reorder the codes for maximum instruction-level parallelism (ILP)

bull It is one of the instruction-level optimizationsbull Instruction scheduling (IS) in general is NP-complete so heuristics must be used

Instruction Schedular

Original Code Reordered Code

1120418 coursecpeg421-10FTopic2appt 20

Instruction schedulingA Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can execute them in parallel assuming adequate hardware processing resources

time

a = 1 + x b = 2 + y c = 3 + z

1120418 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in modern hardware

bullpipeliningbullsuperscalar processingbullVLIWbullmultiprocessing

Of these the first three forms are commonly exploited by todayrsquos compilersrsquo instruction scheduling phase

1120418 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be overlapped It has the same principle as the assembly line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is accomplished by adding more hardware for parallel execution of stages and for dispatching instructions to them

1120418 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch- decode and register fetch- execute on ALU- memory access- write back to register file

1120418 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each instruction is in a different stage but every stage is active

The pipeline is ldquofullrdquo here

1120418 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Examplei1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 0(r1)i4 add r5 r3 r4

Consider two possible instruction schedules (permutations)`

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

1120418 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in the same pipeline stage at the same time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

1120418 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints If instruction A computes a value that is read by instruction B then B canrsquot execute before A is completedResource hazards Finiteness of hardware function units means limited parallelism

1120418 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources bull finite set of FUs with instruction type and width and latency constraints bull cache hierarchy also has many constraints1048707 Data Dependences bull canrsquot consume a result before it is produced bull ambiguous dependences create many challenges1048707 Control Dependences bull impractical to schedule for all possible paths bull choosing an ldquoexpectedrdquo path may be difficult bull recovery costs can be non-trivial if you are wrong

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 5: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 5

Basic Concept and MotivationBasic Concept and MotivationBasic Concept and MotivationBasic Concept and Motivation

Data dependence between 2 accesses

The same memory location

Exist an execution path between them

At least one of them is a write

Three types of data dependencies

Dependence graphs

Things are not simple when dealing with loops

1120418 coursecpeg421-10FTopic2appt 6

Data Dependencies

There is a data dependence between statements Si and Sj if and only if Both statements access the same

memory location and at least one of the statements writes into it and

There is a feasible run-time execution path from Si to Sj

1120418 coursecpeg421-10FTopic2appt 7

Types of Types of Data Dependencies

Flow (true) Dependencies - writeread ()x = 4hellipy = x + 1

Output Dependencies - writewrite (o)x = 4 hellipx = y + 1

Anti-dependencies - readwrite (-1)y = x + 1hellipx = 4

-1--

0

1120418 coursecpeg421-10FTopic2appt 8

(1) x = 4(2) y = 6(3) p = x + 2(4) z = y + p(5) x = z(6) y = p

x = 4 y = 6

p = x + 2

z = y + p

y = p x = zFlow

An Example of An Example of Data Dependencies

OutputAnti

1120418 coursecpeg421-10FTopic2appt 9

Data Dependence Graph (DDG)

Forms a data dependence graph between statements nodes = statements edges = dependence relation (type label)

1120418 coursecpeg421-10FTopic2appt 10

Reordering Transformations using DDG

Given a correct data dependence graph any order-based optimization that does not change the dependences of a program is guaranteed not to change the results of the program

1120418 coursecpeg421-10FTopic2appt 11

Reordering Transformations

A reordering transformation is any program transformation that merely changes the order of execution of the code without adding or deleting any executions of any statements

A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence

1120418 coursecpeg421-10FTopic2appt 12

Instruction Scheduling Loop restructuring Exploiting Parallelism

Analyze array references to determine whether two iterations access the same memory location Iterations I1 and I2 can be safely executed in parallel if there is no data dependency between them

hellip

Reordering Transformations (Conrsquot)

1120418 coursecpeg421-10FTopic2appt 13

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

Example 1

S1 A = 0

S2 B = A

S3 C = A + D

S4 D = 2

Sx Sy flow dependence

S1

S2

S3

S4

1120418 coursecpeg421-10FTopic2appt 14

Example 2

S1 A = 0S2 B = AS3 A = B + 1S4 C = A

S1

S2

S3

S4

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

1120418 coursecpeg421-10FTopic2appt 15

Should we consider input Should we consider input dependencedependence

Should we consider input Should we consider input dependencedependence

= X

= X

Is the reading of the same X important

Well it may be(if we intend to group the 2 reads together for cache optimization)

1120418 coursecpeg421-10FTopic2appt 16

- register allocation- instruction scheduling- loop scheduling- vectorization- parallelization - memory hierarchy optimization

Applications of Data Applications of Data Dependence GraphDependence Graph

Applications of Data Applications of Data Dependence GraphDependence Graph

1120418 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 -1 s3 s2 s3

(s3) a = x - 2

(s4) end do s3 s2 (next iteration)

1120418 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units Instruction

scheduling can improve the performance of a

program by placing independent target

instructions in parallel or adjacent positions

Motivation

1120418 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we have finished optimizing the IR

bull Instruction scheduling attempts to reorder the codes for maximum instruction-level parallelism (ILP)

bull It is one of the instruction-level optimizationsbull Instruction scheduling (IS) in general is NP-complete so heuristics must be used

Instruction Schedular

Original Code Reordered Code

1120418 coursecpeg421-10FTopic2appt 20

Instruction schedulingA Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can execute them in parallel assuming adequate hardware processing resources

time

a = 1 + x b = 2 + y c = 3 + z

1120418 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in modern hardware

bullpipeliningbullsuperscalar processingbullVLIWbullmultiprocessing

Of these the first three forms are commonly exploited by todayrsquos compilersrsquo instruction scheduling phase

1120418 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be overlapped It has the same principle as the assembly line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is accomplished by adding more hardware for parallel execution of stages and for dispatching instructions to them

1120418 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch- decode and register fetch- execute on ALU- memory access- write back to register file

1120418 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each instruction is in a different stage but every stage is active

The pipeline is ldquofullrdquo here

1120418 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Examplei1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 0(r1)i4 add r5 r3 r4

Consider two possible instruction schedules (permutations)`

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

1120418 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in the same pipeline stage at the same time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

1120418 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints If instruction A computes a value that is read by instruction B then B canrsquot execute before A is completedResource hazards Finiteness of hardware function units means limited parallelism

1120418 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources bull finite set of FUs with instruction type and width and latency constraints bull cache hierarchy also has many constraints1048707 Data Dependences bull canrsquot consume a result before it is produced bull ambiguous dependences create many challenges1048707 Control Dependences bull impractical to schedule for all possible paths bull choosing an ldquoexpectedrdquo path may be difficult bull recovery costs can be non-trivial if you are wrong

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 6: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 6

Data Dependencies

There is a data dependence between statements Si and Sj if and only if Both statements access the same

memory location and at least one of the statements writes into it and

There is a feasible run-time execution path from Si to Sj

1120418 coursecpeg421-10FTopic2appt 7

Types of Types of Data Dependencies

Flow (true) Dependencies - writeread ()x = 4hellipy = x + 1

Output Dependencies - writewrite (o)x = 4 hellipx = y + 1

Anti-dependencies - readwrite (-1)y = x + 1hellipx = 4

-1--

0

1120418 coursecpeg421-10FTopic2appt 8

(1) x = 4(2) y = 6(3) p = x + 2(4) z = y + p(5) x = z(6) y = p

x = 4 y = 6

p = x + 2

z = y + p

y = p x = zFlow

An Example of An Example of Data Dependencies

OutputAnti

1120418 coursecpeg421-10FTopic2appt 9

Data Dependence Graph (DDG)

Forms a data dependence graph between statements nodes = statements edges = dependence relation (type label)

1120418 coursecpeg421-10FTopic2appt 10

Reordering Transformations using DDG

Given a correct data dependence graph any order-based optimization that does not change the dependences of a program is guaranteed not to change the results of the program

1120418 coursecpeg421-10FTopic2appt 11

Reordering Transformations

A reordering transformation is any program transformation that merely changes the order of execution of the code without adding or deleting any executions of any statements

A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence

1120418 coursecpeg421-10FTopic2appt 12

Instruction Scheduling Loop restructuring Exploiting Parallelism

Analyze array references to determine whether two iterations access the same memory location Iterations I1 and I2 can be safely executed in parallel if there is no data dependency between them

hellip

Reordering Transformations (Conrsquot)

1120418 coursecpeg421-10FTopic2appt 13

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

Example 1

S1 A = 0

S2 B = A

S3 C = A + D

S4 D = 2

Sx Sy flow dependence

S1

S2

S3

S4

1120418 coursecpeg421-10FTopic2appt 14

Example 2

S1 A = 0S2 B = AS3 A = B + 1S4 C = A

S1

S2

S3

S4

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

1120418 coursecpeg421-10FTopic2appt 15

Should we consider input Should we consider input dependencedependence

Should we consider input Should we consider input dependencedependence

= X

= X

Is the reading of the same X important

Well it may be(if we intend to group the 2 reads together for cache optimization)

1120418 coursecpeg421-10FTopic2appt 16

- register allocation- instruction scheduling- loop scheduling- vectorization- parallelization - memory hierarchy optimization

Applications of Data Applications of Data Dependence GraphDependence Graph

Applications of Data Applications of Data Dependence GraphDependence Graph

1120418 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 -1 s3 s2 s3

(s3) a = x - 2

(s4) end do s3 s2 (next iteration)

1120418 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units Instruction

scheduling can improve the performance of a

program by placing independent target

instructions in parallel or adjacent positions

Motivation

1120418 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we have finished optimizing the IR

bull Instruction scheduling attempts to reorder the codes for maximum instruction-level parallelism (ILP)

bull It is one of the instruction-level optimizationsbull Instruction scheduling (IS) in general is NP-complete so heuristics must be used

Instruction Schedular

Original Code Reordered Code

1120418 coursecpeg421-10FTopic2appt 20

Instruction schedulingA Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can execute them in parallel assuming adequate hardware processing resources

time

a = 1 + x b = 2 + y c = 3 + z

1120418 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in modern hardware

bullpipeliningbullsuperscalar processingbullVLIWbullmultiprocessing

Of these the first three forms are commonly exploited by todayrsquos compilersrsquo instruction scheduling phase

1120418 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be overlapped It has the same principle as the assembly line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is accomplished by adding more hardware for parallel execution of stages and for dispatching instructions to them

1120418 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch- decode and register fetch- execute on ALU- memory access- write back to register file

1120418 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each instruction is in a different stage but every stage is active

The pipeline is ldquofullrdquo here

1120418 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Examplei1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 0(r1)i4 add r5 r3 r4

Consider two possible instruction schedules (permutations)`

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

1120418 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in the same pipeline stage at the same time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

1120418 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints If instruction A computes a value that is read by instruction B then B canrsquot execute before A is completedResource hazards Finiteness of hardware function units means limited parallelism

1120418 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources bull finite set of FUs with instruction type and width and latency constraints bull cache hierarchy also has many constraints1048707 Data Dependences bull canrsquot consume a result before it is produced bull ambiguous dependences create many challenges1048707 Control Dependences bull impractical to schedule for all possible paths bull choosing an ldquoexpectedrdquo path may be difficult bull recovery costs can be non-trivial if you are wrong

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 7: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 7

Types of Types of Data Dependencies

Flow (true) Dependencies - writeread ()x = 4hellipy = x + 1

Output Dependencies - writewrite (o)x = 4 hellipx = y + 1

Anti-dependencies - readwrite (-1)y = x + 1hellipx = 4

-1--

0

1120418 coursecpeg421-10FTopic2appt 8

(1) x = 4(2) y = 6(3) p = x + 2(4) z = y + p(5) x = z(6) y = p

x = 4 y = 6

p = x + 2

z = y + p

y = p x = zFlow

An Example of An Example of Data Dependencies

OutputAnti

1120418 coursecpeg421-10FTopic2appt 9

Data Dependence Graph (DDG)

Forms a data dependence graph between statements nodes = statements edges = dependence relation (type label)

1120418 coursecpeg421-10FTopic2appt 10

Reordering Transformations using DDG

Given a correct data dependence graph any order-based optimization that does not change the dependences of a program is guaranteed not to change the results of the program

1120418 coursecpeg421-10FTopic2appt 11

Reordering Transformations

A reordering transformation is any program transformation that merely changes the order of execution of the code without adding or deleting any executions of any statements

A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence

1120418 coursecpeg421-10FTopic2appt 12

Instruction Scheduling Loop restructuring Exploiting Parallelism

Analyze array references to determine whether two iterations access the same memory location Iterations I1 and I2 can be safely executed in parallel if there is no data dependency between them

hellip

Reordering Transformations (Conrsquot)

1120418 coursecpeg421-10FTopic2appt 13

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

Example 1

S1 A = 0

S2 B = A

S3 C = A + D

S4 D = 2

Sx Sy flow dependence

S1

S2

S3

S4

1120418 coursecpeg421-10FTopic2appt 14

Example 2

S1 A = 0S2 B = AS3 A = B + 1S4 C = A

S1

S2

S3

S4

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

1120418 coursecpeg421-10FTopic2appt 15

Should we consider input Should we consider input dependencedependence

Should we consider input Should we consider input dependencedependence

= X

= X

Is the reading of the same X important

Well it may be(if we intend to group the 2 reads together for cache optimization)

1120418 coursecpeg421-10FTopic2appt 16

- register allocation- instruction scheduling- loop scheduling- vectorization- parallelization - memory hierarchy optimization

Applications of Data Applications of Data Dependence GraphDependence Graph

Applications of Data Applications of Data Dependence GraphDependence Graph

1120418 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 -1 s3 s2 s3

(s3) a = x - 2

(s4) end do s3 s2 (next iteration)

1120418 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units Instruction

scheduling can improve the performance of a

program by placing independent target

instructions in parallel or adjacent positions

Motivation

1120418 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we have finished optimizing the IR

bull Instruction scheduling attempts to reorder the codes for maximum instruction-level parallelism (ILP)

bull It is one of the instruction-level optimizationsbull Instruction scheduling (IS) in general is NP-complete so heuristics must be used

Instruction Schedular

Original Code Reordered Code

1120418 coursecpeg421-10FTopic2appt 20

Instruction schedulingA Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can execute them in parallel assuming adequate hardware processing resources

time

a = 1 + x b = 2 + y c = 3 + z

1120418 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in modern hardware

bullpipeliningbullsuperscalar processingbullVLIWbullmultiprocessing

Of these the first three forms are commonly exploited by todayrsquos compilersrsquo instruction scheduling phase

1120418 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be overlapped It has the same principle as the assembly line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is accomplished by adding more hardware for parallel execution of stages and for dispatching instructions to them

1120418 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch- decode and register fetch- execute on ALU- memory access- write back to register file

1120418 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each instruction is in a different stage but every stage is active

The pipeline is ldquofullrdquo here

1120418 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Examplei1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 0(r1)i4 add r5 r3 r4

Consider two possible instruction schedules (permutations)`

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

1120418 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in the same pipeline stage at the same time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

1120418 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints If instruction A computes a value that is read by instruction B then B canrsquot execute before A is completedResource hazards Finiteness of hardware function units means limited parallelism

1120418 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources bull finite set of FUs with instruction type and width and latency constraints bull cache hierarchy also has many constraints1048707 Data Dependences bull canrsquot consume a result before it is produced bull ambiguous dependences create many challenges1048707 Control Dependences bull impractical to schedule for all possible paths bull choosing an ldquoexpectedrdquo path may be difficult bull recovery costs can be non-trivial if you are wrong

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 8: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 8

(1) x = 4(2) y = 6(3) p = x + 2(4) z = y + p(5) x = z(6) y = p

x = 4 y = 6

p = x + 2

z = y + p

y = p x = zFlow

An Example of An Example of Data Dependencies

OutputAnti

1120418 coursecpeg421-10FTopic2appt 9

Data Dependence Graph (DDG)

Forms a data dependence graph between statements nodes = statements edges = dependence relation (type label)

1120418 coursecpeg421-10FTopic2appt 10

Reordering Transformations using DDG

Given a correct data dependence graph any order-based optimization that does not change the dependences of a program is guaranteed not to change the results of the program

1120418 coursecpeg421-10FTopic2appt 11

Reordering Transformations

A reordering transformation is any program transformation that merely changes the order of execution of the code without adding or deleting any executions of any statements

A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence

1120418 coursecpeg421-10FTopic2appt 12

Instruction Scheduling Loop restructuring Exploiting Parallelism

Analyze array references to determine whether two iterations access the same memory location Iterations I1 and I2 can be safely executed in parallel if there is no data dependency between them

hellip

Reordering Transformations (Conrsquot)

1120418 coursecpeg421-10FTopic2appt 13

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

Example 1

S1 A = 0

S2 B = A

S3 C = A + D

S4 D = 2

Sx Sy flow dependence

S1

S2

S3

S4

1120418 coursecpeg421-10FTopic2appt 14

Example 2

S1 A = 0S2 B = AS3 A = B + 1S4 C = A

S1

S2

S3

S4

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

1120418 coursecpeg421-10FTopic2appt 15

Should we consider input Should we consider input dependencedependence

Should we consider input Should we consider input dependencedependence

= X

= X

Is the reading of the same X important

Well it may be(if we intend to group the 2 reads together for cache optimization)

1120418 coursecpeg421-10FTopic2appt 16

- register allocation- instruction scheduling- loop scheduling- vectorization- parallelization - memory hierarchy optimization

Applications of Data Applications of Data Dependence GraphDependence Graph

Applications of Data Applications of Data Dependence GraphDependence Graph

1120418 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 -1 s3 s2 s3

(s3) a = x - 2

(s4) end do s3 s2 (next iteration)

1120418 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units Instruction

scheduling can improve the performance of a

program by placing independent target

instructions in parallel or adjacent positions

Motivation

1120418 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we have finished optimizing the IR

bull Instruction scheduling attempts to reorder the codes for maximum instruction-level parallelism (ILP)

bull It is one of the instruction-level optimizationsbull Instruction scheduling (IS) in general is NP-complete so heuristics must be used

Instruction Schedular

Original Code Reordered Code

1120418 coursecpeg421-10FTopic2appt 20

Instruction schedulingA Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can execute them in parallel assuming adequate hardware processing resources

time

a = 1 + x b = 2 + y c = 3 + z

1120418 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in modern hardware

bullpipeliningbullsuperscalar processingbullVLIWbullmultiprocessing

Of these the first three forms are commonly exploited by todayrsquos compilersrsquo instruction scheduling phase

1120418 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be overlapped It has the same principle as the assembly line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is accomplished by adding more hardware for parallel execution of stages and for dispatching instructions to them

1120418 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch- decode and register fetch- execute on ALU- memory access- write back to register file

1120418 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each instruction is in a different stage but every stage is active

The pipeline is ldquofullrdquo here

1120418 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Examplei1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 0(r1)i4 add r5 r3 r4

Consider two possible instruction schedules (permutations)`

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

1120418 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in the same pipeline stage at the same time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

1120418 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints If instruction A computes a value that is read by instruction B then B canrsquot execute before A is completedResource hazards Finiteness of hardware function units means limited parallelism

1120418 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources bull finite set of FUs with instruction type and width and latency constraints bull cache hierarchy also has many constraints1048707 Data Dependences bull canrsquot consume a result before it is produced bull ambiguous dependences create many challenges1048707 Control Dependences bull impractical to schedule for all possible paths bull choosing an ldquoexpectedrdquo path may be difficult bull recovery costs can be non-trivial if you are wrong

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 9: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 9

Data Dependence Graph (DDG)

Forms a data dependence graph between statements nodes = statements edges = dependence relation (type label)

1120418 coursecpeg421-10FTopic2appt 10

Reordering Transformations using DDG

Given a correct data dependence graph any order-based optimization that does not change the dependences of a program is guaranteed not to change the results of the program

1120418 coursecpeg421-10FTopic2appt 11

Reordering Transformations

A reordering transformation is any program transformation that merely changes the order of execution of the code without adding or deleting any executions of any statements

A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence

1120418 coursecpeg421-10FTopic2appt 12

Instruction Scheduling Loop restructuring Exploiting Parallelism

Analyze array references to determine whether two iterations access the same memory location Iterations I1 and I2 can be safely executed in parallel if there is no data dependency between them

hellip

Reordering Transformations (Conrsquot)

1120418 coursecpeg421-10FTopic2appt 13

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

Example 1

S1 A = 0

S2 B = A

S3 C = A + D

S4 D = 2

Sx Sy flow dependence

S1

S2

S3

S4

1120418 coursecpeg421-10FTopic2appt 14

Example 2

S1 A = 0S2 B = AS3 A = B + 1S4 C = A

S1

S2

S3

S4

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

1120418 coursecpeg421-10FTopic2appt 15

Should we consider input Should we consider input dependencedependence

Should we consider input Should we consider input dependencedependence

= X

= X

Is the reading of the same X important

Well it may be(if we intend to group the 2 reads together for cache optimization)

1120418 coursecpeg421-10FTopic2appt 16

- register allocation- instruction scheduling- loop scheduling- vectorization- parallelization - memory hierarchy optimization

Applications of Data Applications of Data Dependence GraphDependence Graph

Applications of Data Applications of Data Dependence GraphDependence Graph

1120418 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 -1 s3 s2 s3

(s3) a = x - 2

(s4) end do s3 s2 (next iteration)

1120418 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units Instruction

scheduling can improve the performance of a

program by placing independent target

instructions in parallel or adjacent positions

Motivation

1120418 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we have finished optimizing the IR

bull Instruction scheduling attempts to reorder the codes for maximum instruction-level parallelism (ILP)

bull It is one of the instruction-level optimizationsbull Instruction scheduling (IS) in general is NP-complete so heuristics must be used

Instruction Schedular

Original Code Reordered Code

1120418 coursecpeg421-10FTopic2appt 20

Instruction schedulingA Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can execute them in parallel assuming adequate hardware processing resources

time

a = 1 + x b = 2 + y c = 3 + z

1120418 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in modern hardware

bullpipeliningbullsuperscalar processingbullVLIWbullmultiprocessing

Of these the first three forms are commonly exploited by todayrsquos compilersrsquo instruction scheduling phase

1120418 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be overlapped It has the same principle as the assembly line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is accomplished by adding more hardware for parallel execution of stages and for dispatching instructions to them

1120418 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch- decode and register fetch- execute on ALU- memory access- write back to register file

1120418 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each instruction is in a different stage but every stage is active

The pipeline is ldquofullrdquo here

1120418 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Examplei1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 0(r1)i4 add r5 r3 r4

Consider two possible instruction schedules (permutations)`

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

1120418 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in the same pipeline stage at the same time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

1120418 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints If instruction A computes a value that is read by instruction B then B canrsquot execute before A is completedResource hazards Finiteness of hardware function units means limited parallelism

1120418 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources bull finite set of FUs with instruction type and width and latency constraints bull cache hierarchy also has many constraints1048707 Data Dependences bull canrsquot consume a result before it is produced bull ambiguous dependences create many challenges1048707 Control Dependences bull impractical to schedule for all possible paths bull choosing an ldquoexpectedrdquo path may be difficult bull recovery costs can be non-trivial if you are wrong

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 10: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 10

Reordering Transformations using DDG

Given a correct data dependence graph any order-based optimization that does not change the dependences of a program is guaranteed not to change the results of the program

1120418 coursecpeg421-10FTopic2appt 11

Reordering Transformations

A reordering transformation is any program transformation that merely changes the order of execution of the code without adding or deleting any executions of any statements

A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence

1120418 coursecpeg421-10FTopic2appt 12

Instruction Scheduling Loop restructuring Exploiting Parallelism

Analyze array references to determine whether two iterations access the same memory location Iterations I1 and I2 can be safely executed in parallel if there is no data dependency between them

hellip

Reordering Transformations (Conrsquot)

1120418 coursecpeg421-10FTopic2appt 13

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

Example 1

S1 A = 0

S2 B = A

S3 C = A + D

S4 D = 2

Sx Sy flow dependence

S1

S2

S3

S4

1120418 coursecpeg421-10FTopic2appt 14

Example 2

S1 A = 0S2 B = AS3 A = B + 1S4 C = A

S1

S2

S3

S4

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

1120418 coursecpeg421-10FTopic2appt 15

Should we consider input Should we consider input dependencedependence

Should we consider input Should we consider input dependencedependence

= X

= X

Is the reading of the same X important

Well it may be(if we intend to group the 2 reads together for cache optimization)

1120418 coursecpeg421-10FTopic2appt 16

- register allocation- instruction scheduling- loop scheduling- vectorization- parallelization - memory hierarchy optimization

Applications of Data Applications of Data Dependence GraphDependence Graph

Applications of Data Applications of Data Dependence GraphDependence Graph

1120418 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 -1 s3 s2 s3

(s3) a = x - 2

(s4) end do s3 s2 (next iteration)

1120418 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units Instruction

scheduling can improve the performance of a

program by placing independent target

instructions in parallel or adjacent positions

Motivation

1120418 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we have finished optimizing the IR

bull Instruction scheduling attempts to reorder the codes for maximum instruction-level parallelism (ILP)

bull It is one of the instruction-level optimizationsbull Instruction scheduling (IS) in general is NP-complete so heuristics must be used

Instruction Schedular

Original Code Reordered Code

1120418 coursecpeg421-10FTopic2appt 20

Instruction schedulingA Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can execute them in parallel assuming adequate hardware processing resources

time

a = 1 + x b = 2 + y c = 3 + z

1120418 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in modern hardware

bullpipeliningbullsuperscalar processingbullVLIWbullmultiprocessing

Of these the first three forms are commonly exploited by todayrsquos compilersrsquo instruction scheduling phase

1120418 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be overlapped It has the same principle as the assembly line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is accomplished by adding more hardware for parallel execution of stages and for dispatching instructions to them

1120418 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch- decode and register fetch- execute on ALU- memory access- write back to register file

1120418 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each instruction is in a different stage but every stage is active

The pipeline is ldquofullrdquo here

1120418 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Examplei1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 0(r1)i4 add r5 r3 r4

Consider two possible instruction schedules (permutations)`

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

1120418 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in the same pipeline stage at the same time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

1120418 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints If instruction A computes a value that is read by instruction B then B canrsquot execute before A is completedResource hazards Finiteness of hardware function units means limited parallelism

1120418 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources bull finite set of FUs with instruction type and width and latency constraints bull cache hierarchy also has many constraints1048707 Data Dependences bull canrsquot consume a result before it is produced bull ambiguous dependences create many challenges1048707 Control Dependences bull impractical to schedule for all possible paths bull choosing an ldquoexpectedrdquo path may be difficult bull recovery costs can be non-trivial if you are wrong

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 11: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 11

Reordering Transformations

A reordering transformation is any program transformation that merely changes the order of execution of the code without adding or deleting any executions of any statements

A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence

1120418 coursecpeg421-10FTopic2appt 12

Instruction Scheduling Loop restructuring Exploiting Parallelism

Analyze array references to determine whether two iterations access the same memory location Iterations I1 and I2 can be safely executed in parallel if there is no data dependency between them

hellip

Reordering Transformations (Conrsquot)

1120418 coursecpeg421-10FTopic2appt 13

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

Example 1

S1 A = 0

S2 B = A

S3 C = A + D

S4 D = 2

Sx Sy flow dependence

S1

S2

S3

S4

1120418 coursecpeg421-10FTopic2appt 14

Example 2

S1 A = 0S2 B = AS3 A = B + 1S4 C = A

S1

S2

S3

S4

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

1120418 coursecpeg421-10FTopic2appt 15

Should we consider input Should we consider input dependencedependence

Should we consider input Should we consider input dependencedependence

= X

= X

Is the reading of the same X important

Well it may be(if we intend to group the 2 reads together for cache optimization)

1120418 coursecpeg421-10FTopic2appt 16

- register allocation- instruction scheduling- loop scheduling- vectorization- parallelization - memory hierarchy optimization

Applications of Data Applications of Data Dependence GraphDependence Graph

Applications of Data Applications of Data Dependence GraphDependence Graph

1120418 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 -1 s3 s2 s3

(s3) a = x - 2

(s4) end do s3 s2 (next iteration)

1120418 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units Instruction

scheduling can improve the performance of a

program by placing independent target

instructions in parallel or adjacent positions

Motivation

1120418 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we have finished optimizing the IR

bull Instruction scheduling attempts to reorder the codes for maximum instruction-level parallelism (ILP)

bull It is one of the instruction-level optimizationsbull Instruction scheduling (IS) in general is NP-complete so heuristics must be used

Instruction Schedular

Original Code Reordered Code

1120418 coursecpeg421-10FTopic2appt 20

Instruction schedulingA Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can execute them in parallel assuming adequate hardware processing resources

time

a = 1 + x b = 2 + y c = 3 + z

1120418 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in modern hardware

bullpipeliningbullsuperscalar processingbullVLIWbullmultiprocessing

Of these the first three forms are commonly exploited by todayrsquos compilersrsquo instruction scheduling phase

1120418 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be overlapped It has the same principle as the assembly line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is accomplished by adding more hardware for parallel execution of stages and for dispatching instructions to them

1120418 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch- decode and register fetch- execute on ALU- memory access- write back to register file

1120418 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each instruction is in a different stage but every stage is active

The pipeline is ldquofullrdquo here

1120418 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Examplei1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 0(r1)i4 add r5 r3 r4

Consider two possible instruction schedules (permutations)`

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

1120418 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in the same pipeline stage at the same time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

1120418 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints If instruction A computes a value that is read by instruction B then B canrsquot execute before A is completedResource hazards Finiteness of hardware function units means limited parallelism

1120418 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources bull finite set of FUs with instruction type and width and latency constraints bull cache hierarchy also has many constraints1048707 Data Dependences bull canrsquot consume a result before it is produced bull ambiguous dependences create many challenges1048707 Control Dependences bull impractical to schedule for all possible paths bull choosing an ldquoexpectedrdquo path may be difficult bull recovery costs can be non-trivial if you are wrong

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 12: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 12

Instruction Scheduling Loop restructuring Exploiting Parallelism

Analyze array references to determine whether two iterations access the same memory location Iterations I1 and I2 can be safely executed in parallel if there is no data dependency between them

hellip

Reordering Transformations (Conrsquot)

1120418 coursecpeg421-10FTopic2appt 13

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

Example 1

S1 A = 0

S2 B = A

S3 C = A + D

S4 D = 2

Sx Sy flow dependence

S1

S2

S3

S4

1120418 coursecpeg421-10FTopic2appt 14

Example 2

S1 A = 0S2 B = AS3 A = B + 1S4 C = A

S1

S2

S3

S4

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

1120418 coursecpeg421-10FTopic2appt 15

Should we consider input Should we consider input dependencedependence

Should we consider input Should we consider input dependencedependence

= X

= X

Is the reading of the same X important

Well it may be(if we intend to group the 2 reads together for cache optimization)

1120418 coursecpeg421-10FTopic2appt 16

- register allocation- instruction scheduling- loop scheduling- vectorization- parallelization - memory hierarchy optimization

Applications of Data Applications of Data Dependence GraphDependence Graph

Applications of Data Applications of Data Dependence GraphDependence Graph

1120418 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 -1 s3 s2 s3

(s3) a = x - 2

(s4) end do s3 s2 (next iteration)

1120418 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units Instruction

scheduling can improve the performance of a

program by placing independent target

instructions in parallel or adjacent positions

Motivation

1120418 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we have finished optimizing the IR

bull Instruction scheduling attempts to reorder the codes for maximum instruction-level parallelism (ILP)

bull It is one of the instruction-level optimizationsbull Instruction scheduling (IS) in general is NP-complete so heuristics must be used

Instruction Schedular

Original Code Reordered Code

1120418 coursecpeg421-10FTopic2appt 20

Instruction schedulingA Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can execute them in parallel assuming adequate hardware processing resources

time

a = 1 + x b = 2 + y c = 3 + z

1120418 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in modern hardware

bullpipeliningbullsuperscalar processingbullVLIWbullmultiprocessing

Of these the first three forms are commonly exploited by todayrsquos compilersrsquo instruction scheduling phase

1120418 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be overlapped It has the same principle as the assembly line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is accomplished by adding more hardware for parallel execution of stages and for dispatching instructions to them

1120418 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch- decode and register fetch- execute on ALU- memory access- write back to register file

1120418 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each instruction is in a different stage but every stage is active

The pipeline is ldquofullrdquo here

1120418 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Examplei1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 0(r1)i4 add r5 r3 r4

Consider two possible instruction schedules (permutations)`

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

1120418 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in the same pipeline stage at the same time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

1120418 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints If instruction A computes a value that is read by instruction B then B canrsquot execute before A is completedResource hazards Finiteness of hardware function units means limited parallelism

1120418 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources bull finite set of FUs with instruction type and width and latency constraints bull cache hierarchy also has many constraints1048707 Data Dependences bull canrsquot consume a result before it is produced bull ambiguous dependences create many challenges1048707 Control Dependences bull impractical to schedule for all possible paths bull choosing an ldquoexpectedrdquo path may be difficult bull recovery costs can be non-trivial if you are wrong

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 13: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 13

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

Example 1

S1 A = 0

S2 B = A

S3 C = A + D

S4 D = 2

Sx Sy flow dependence

S1

S2

S3

S4

1120418 coursecpeg421-10FTopic2appt 14

Example 2

S1 A = 0S2 B = AS3 A = B + 1S4 C = A

S1

S2

S3

S4

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

1120418 coursecpeg421-10FTopic2appt 15

Should we consider input Should we consider input dependencedependence

Should we consider input Should we consider input dependencedependence

= X

= X

Is the reading of the same X important

Well it may be(if we intend to group the 2 reads together for cache optimization)

1120418 coursecpeg421-10FTopic2appt 16

- register allocation- instruction scheduling- loop scheduling- vectorization- parallelization - memory hierarchy optimization

Applications of Data Applications of Data Dependence GraphDependence Graph

Applications of Data Applications of Data Dependence GraphDependence Graph

1120418 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 -1 s3 s2 s3

(s3) a = x - 2

(s4) end do s3 s2 (next iteration)

1120418 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units Instruction

scheduling can improve the performance of a

program by placing independent target

instructions in parallel or adjacent positions

Motivation

1120418 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we have finished optimizing the IR

bull Instruction scheduling attempts to reorder the codes for maximum instruction-level parallelism (ILP)

bull It is one of the instruction-level optimizationsbull Instruction scheduling (IS) in general is NP-complete so heuristics must be used

Instruction Schedular

Original Code Reordered Code

1120418 coursecpeg421-10FTopic2appt 20

Instruction schedulingA Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can execute them in parallel assuming adequate hardware processing resources

time

a = 1 + x b = 2 + y c = 3 + z

1120418 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in modern hardware

bullpipeliningbullsuperscalar processingbullVLIWbullmultiprocessing

Of these the first three forms are commonly exploited by todayrsquos compilersrsquo instruction scheduling phase

1120418 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be overlapped It has the same principle as the assembly line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is accomplished by adding more hardware for parallel execution of stages and for dispatching instructions to them

1120418 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch- decode and register fetch- execute on ALU- memory access- write back to register file

1120418 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each instruction is in a different stage but every stage is active

The pipeline is ldquofullrdquo here

1120418 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Examplei1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 0(r1)i4 add r5 r3 r4

Consider two possible instruction schedules (permutations)`

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

1120418 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in the same pipeline stage at the same time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

1120418 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints If instruction A computes a value that is read by instruction B then B canrsquot execute before A is completedResource hazards Finiteness of hardware function units means limited parallelism

1120418 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources bull finite set of FUs with instruction type and width and latency constraints bull cache hierarchy also has many constraints1048707 Data Dependences bull canrsquot consume a result before it is produced bull ambiguous dependences create many challenges1048707 Control Dependences bull impractical to schedule for all possible paths bull choosing an ldquoexpectedrdquo path may be difficult bull recovery costs can be non-trivial if you are wrong

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 14: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 14

Example 2

S1 A = 0S2 B = AS3 A = B + 1S4 C = A

S1

S2

S3

S4

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

1120418 coursecpeg421-10FTopic2appt 15

Should we consider input Should we consider input dependencedependence

Should we consider input Should we consider input dependencedependence

= X

= X

Is the reading of the same X important

Well it may be(if we intend to group the 2 reads together for cache optimization)

1120418 coursecpeg421-10FTopic2appt 16

- register allocation- instruction scheduling- loop scheduling- vectorization- parallelization - memory hierarchy optimization

Applications of Data Applications of Data Dependence GraphDependence Graph

Applications of Data Applications of Data Dependence GraphDependence Graph

1120418 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 -1 s3 s2 s3

(s3) a = x - 2

(s4) end do s3 s2 (next iteration)

1120418 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units Instruction

scheduling can improve the performance of a

program by placing independent target

instructions in parallel or adjacent positions

Motivation

1120418 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we have finished optimizing the IR

bull Instruction scheduling attempts to reorder the codes for maximum instruction-level parallelism (ILP)

bull It is one of the instruction-level optimizationsbull Instruction scheduling (IS) in general is NP-complete so heuristics must be used

Instruction Schedular

Original Code Reordered Code

1120418 coursecpeg421-10FTopic2appt 20

Instruction schedulingA Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can execute them in parallel assuming adequate hardware processing resources

time

a = 1 + x b = 2 + y c = 3 + z

1120418 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in modern hardware

bullpipeliningbullsuperscalar processingbullVLIWbullmultiprocessing

Of these the first three forms are commonly exploited by todayrsquos compilersrsquo instruction scheduling phase

1120418 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be overlapped It has the same principle as the assembly line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is accomplished by adding more hardware for parallel execution of stages and for dispatching instructions to them

1120418 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch- decode and register fetch- execute on ALU- memory access- write back to register file

1120418 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each instruction is in a different stage but every stage is active

The pipeline is ldquofullrdquo here

1120418 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Examplei1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 0(r1)i4 add r5 r3 r4

Consider two possible instruction schedules (permutations)`

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

1120418 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in the same pipeline stage at the same time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

1120418 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints If instruction A computes a value that is read by instruction B then B canrsquot execute before A is completedResource hazards Finiteness of hardware function units means limited parallelism

1120418 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources bull finite set of FUs with instruction type and width and latency constraints bull cache hierarchy also has many constraints1048707 Data Dependences bull canrsquot consume a result before it is produced bull ambiguous dependences create many challenges1048707 Control Dependences bull impractical to schedule for all possible paths bull choosing an ldquoexpectedrdquo path may be difficult bull recovery costs can be non-trivial if you are wrong

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 15: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 15

Should we consider input Should we consider input dependencedependence

Should we consider input Should we consider input dependencedependence

= X

= X

Is the reading of the same X important

Well it may be(if we intend to group the 2 reads together for cache optimization)

1120418 coursecpeg421-10FTopic2appt 16

- register allocation- instruction scheduling- loop scheduling- vectorization- parallelization - memory hierarchy optimization

Applications of Data Applications of Data Dependence GraphDependence Graph

Applications of Data Applications of Data Dependence GraphDependence Graph

1120418 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 -1 s3 s2 s3

(s3) a = x - 2

(s4) end do s3 s2 (next iteration)

1120418 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units Instruction

scheduling can improve the performance of a

program by placing independent target

instructions in parallel or adjacent positions

Motivation

1120418 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we have finished optimizing the IR

bull Instruction scheduling attempts to reorder the codes for maximum instruction-level parallelism (ILP)

bull It is one of the instruction-level optimizationsbull Instruction scheduling (IS) in general is NP-complete so heuristics must be used

Instruction Schedular

Original Code Reordered Code

1120418 coursecpeg421-10FTopic2appt 20

Instruction schedulingA Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can execute them in parallel assuming adequate hardware processing resources

time

a = 1 + x b = 2 + y c = 3 + z

1120418 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in modern hardware

bullpipeliningbullsuperscalar processingbullVLIWbullmultiprocessing

Of these the first three forms are commonly exploited by todayrsquos compilersrsquo instruction scheduling phase

1120418 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be overlapped It has the same principle as the assembly line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is accomplished by adding more hardware for parallel execution of stages and for dispatching instructions to them

1120418 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch- decode and register fetch- execute on ALU- memory access- write back to register file

1120418 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each instruction is in a different stage but every stage is active

The pipeline is ldquofullrdquo here

1120418 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Examplei1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 0(r1)i4 add r5 r3 r4

Consider two possible instruction schedules (permutations)`

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

1120418 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in the same pipeline stage at the same time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

1120418 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints If instruction A computes a value that is read by instruction B then B canrsquot execute before A is completedResource hazards Finiteness of hardware function units means limited parallelism

1120418 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources bull finite set of FUs with instruction type and width and latency constraints bull cache hierarchy also has many constraints1048707 Data Dependences bull canrsquot consume a result before it is produced bull ambiguous dependences create many challenges1048707 Control Dependences bull impractical to schedule for all possible paths bull choosing an ldquoexpectedrdquo path may be difficult bull recovery costs can be non-trivial if you are wrong

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 16: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 16

- register allocation- instruction scheduling- loop scheduling- vectorization- parallelization - memory hierarchy optimization

Applications of Data Applications of Data Dependence GraphDependence Graph

Applications of Data Applications of Data Dependence GraphDependence Graph

1120418 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 -1 s3 s2 s3

(s3) a = x - 2

(s4) end do s3 s2 (next iteration)

1120418 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units Instruction

scheduling can improve the performance of a

program by placing independent target

instructions in parallel or adjacent positions

Motivation

1120418 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we have finished optimizing the IR

bull Instruction scheduling attempts to reorder the codes for maximum instruction-level parallelism (ILP)

bull It is one of the instruction-level optimizationsbull Instruction scheduling (IS) in general is NP-complete so heuristics must be used

Instruction Schedular

Original Code Reordered Code

1120418 coursecpeg421-10FTopic2appt 20

Instruction schedulingA Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can execute them in parallel assuming adequate hardware processing resources

time

a = 1 + x b = 2 + y c = 3 + z

1120418 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in modern hardware

bullpipeliningbullsuperscalar processingbullVLIWbullmultiprocessing

Of these the first three forms are commonly exploited by todayrsquos compilersrsquo instruction scheduling phase

1120418 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be overlapped It has the same principle as the assembly line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is accomplished by adding more hardware for parallel execution of stages and for dispatching instructions to them

1120418 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch- decode and register fetch- execute on ALU- memory access- write back to register file

1120418 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each instruction is in a different stage but every stage is active

The pipeline is ldquofullrdquo here

1120418 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Examplei1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 0(r1)i4 add r5 r3 r4

Consider two possible instruction schedules (permutations)`

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

1120418 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in the same pipeline stage at the same time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

1120418 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints If instruction A computes a value that is read by instruction B then B canrsquot execute before A is completedResource hazards Finiteness of hardware function units means limited parallelism

1120418 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources bull finite set of FUs with instruction type and width and latency constraints bull cache hierarchy also has many constraints1048707 Data Dependences bull canrsquot consume a result before it is produced bull ambiguous dependences create many challenges1048707 Control Dependences bull impractical to schedule for all possible paths bull choosing an ldquoexpectedrdquo path may be difficult bull recovery costs can be non-trivial if you are wrong

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 17: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 -1 s3 s2 s3

(s3) a = x - 2

(s4) end do s3 s2 (next iteration)

1120418 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units Instruction

scheduling can improve the performance of a

program by placing independent target

instructions in parallel or adjacent positions

Motivation

1120418 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we have finished optimizing the IR

bull Instruction scheduling attempts to reorder the codes for maximum instruction-level parallelism (ILP)

bull It is one of the instruction-level optimizationsbull Instruction scheduling (IS) in general is NP-complete so heuristics must be used

Instruction Schedular

Original Code Reordered Code

1120418 coursecpeg421-10FTopic2appt 20

Instruction schedulingA Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can execute them in parallel assuming adequate hardware processing resources

time

a = 1 + x b = 2 + y c = 3 + z

1120418 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in modern hardware

bullpipeliningbullsuperscalar processingbullVLIWbullmultiprocessing

Of these the first three forms are commonly exploited by todayrsquos compilersrsquo instruction scheduling phase

1120418 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be overlapped It has the same principle as the assembly line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is accomplished by adding more hardware for parallel execution of stages and for dispatching instructions to them

1120418 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch- decode and register fetch- execute on ALU- memory access- write back to register file

1120418 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each instruction is in a different stage but every stage is active

The pipeline is ldquofullrdquo here

1120418 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Examplei1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 0(r1)i4 add r5 r3 r4

Consider two possible instruction schedules (permutations)`

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

1120418 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in the same pipeline stage at the same time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

1120418 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints If instruction A computes a value that is read by instruction B then B canrsquot execute before A is completedResource hazards Finiteness of hardware function units means limited parallelism

1120418 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources bull finite set of FUs with instruction type and width and latency constraints bull cache hierarchy also has many constraints1048707 Data Dependences bull canrsquot consume a result before it is produced bull ambiguous dependences create many challenges1048707 Control Dependences bull impractical to schedule for all possible paths bull choosing an ldquoexpectedrdquo path may be difficult bull recovery costs can be non-trivial if you are wrong

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 18: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units Instruction

scheduling can improve the performance of a

program by placing independent target

instructions in parallel or adjacent positions

Motivation

1120418 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we have finished optimizing the IR

bull Instruction scheduling attempts to reorder the codes for maximum instruction-level parallelism (ILP)

bull It is one of the instruction-level optimizationsbull Instruction scheduling (IS) in general is NP-complete so heuristics must be used

Instruction Schedular

Original Code Reordered Code

1120418 coursecpeg421-10FTopic2appt 20

Instruction schedulingA Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can execute them in parallel assuming adequate hardware processing resources

time

a = 1 + x b = 2 + y c = 3 + z

1120418 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in modern hardware

bullpipeliningbullsuperscalar processingbullVLIWbullmultiprocessing

Of these the first three forms are commonly exploited by todayrsquos compilersrsquo instruction scheduling phase

1120418 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be overlapped It has the same principle as the assembly line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is accomplished by adding more hardware for parallel execution of stages and for dispatching instructions to them

1120418 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch- decode and register fetch- execute on ALU- memory access- write back to register file

1120418 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each instruction is in a different stage but every stage is active

The pipeline is ldquofullrdquo here

1120418 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Examplei1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 0(r1)i4 add r5 r3 r4

Consider two possible instruction schedules (permutations)`

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

1120418 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in the same pipeline stage at the same time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

1120418 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints If instruction A computes a value that is read by instruction B then B canrsquot execute before A is completedResource hazards Finiteness of hardware function units means limited parallelism

1120418 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources bull finite set of FUs with instruction type and width and latency constraints bull cache hierarchy also has many constraints1048707 Data Dependences bull canrsquot consume a result before it is produced bull ambiguous dependences create many challenges1048707 Control Dependences bull impractical to schedule for all possible paths bull choosing an ldquoexpectedrdquo path may be difficult bull recovery costs can be non-trivial if you are wrong

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 19: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we have finished optimizing the IR

bull Instruction scheduling attempts to reorder the codes for maximum instruction-level parallelism (ILP)

bull It is one of the instruction-level optimizationsbull Instruction scheduling (IS) in general is NP-complete so heuristics must be used

Instruction Schedular

Original Code Reordered Code

1120418 coursecpeg421-10FTopic2appt 20

Instruction schedulingA Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can execute them in parallel assuming adequate hardware processing resources

time

a = 1 + x b = 2 + y c = 3 + z

1120418 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in modern hardware

bullpipeliningbullsuperscalar processingbullVLIWbullmultiprocessing

Of these the first three forms are commonly exploited by todayrsquos compilersrsquo instruction scheduling phase

1120418 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be overlapped It has the same principle as the assembly line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is accomplished by adding more hardware for parallel execution of stages and for dispatching instructions to them

1120418 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch- decode and register fetch- execute on ALU- memory access- write back to register file

1120418 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each instruction is in a different stage but every stage is active

The pipeline is ldquofullrdquo here

1120418 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Examplei1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 0(r1)i4 add r5 r3 r4

Consider two possible instruction schedules (permutations)`

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

1120418 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in the same pipeline stage at the same time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

1120418 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints If instruction A computes a value that is read by instruction B then B canrsquot execute before A is completedResource hazards Finiteness of hardware function units means limited parallelism

1120418 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources bull finite set of FUs with instruction type and width and latency constraints bull cache hierarchy also has many constraints1048707 Data Dependences bull canrsquot consume a result before it is produced bull ambiguous dependences create many challenges1048707 Control Dependences bull impractical to schedule for all possible paths bull choosing an ldquoexpectedrdquo path may be difficult bull recovery costs can be non-trivial if you are wrong

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 20: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 20

Instruction schedulingA Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can execute them in parallel assuming adequate hardware processing resources

time

a = 1 + x b = 2 + y c = 3 + z

1120418 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in modern hardware

bullpipeliningbullsuperscalar processingbullVLIWbullmultiprocessing

Of these the first three forms are commonly exploited by todayrsquos compilersrsquo instruction scheduling phase

1120418 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be overlapped It has the same principle as the assembly line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is accomplished by adding more hardware for parallel execution of stages and for dispatching instructions to them

1120418 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch- decode and register fetch- execute on ALU- memory access- write back to register file

1120418 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each instruction is in a different stage but every stage is active

The pipeline is ldquofullrdquo here

1120418 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Examplei1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 0(r1)i4 add r5 r3 r4

Consider two possible instruction schedules (permutations)`

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

1120418 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in the same pipeline stage at the same time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

1120418 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints If instruction A computes a value that is read by instruction B then B canrsquot execute before A is completedResource hazards Finiteness of hardware function units means limited parallelism

1120418 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources bull finite set of FUs with instruction type and width and latency constraints bull cache hierarchy also has many constraints1048707 Data Dependences bull canrsquot consume a result before it is produced bull ambiguous dependences create many challenges1048707 Control Dependences bull impractical to schedule for all possible paths bull choosing an ldquoexpectedrdquo path may be difficult bull recovery costs can be non-trivial if you are wrong

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 21: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in modern hardware

bullpipeliningbullsuperscalar processingbullVLIWbullmultiprocessing

Of these the first three forms are commonly exploited by todayrsquos compilersrsquo instruction scheduling phase

1120418 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be overlapped It has the same principle as the assembly line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is accomplished by adding more hardware for parallel execution of stages and for dispatching instructions to them

1120418 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch- decode and register fetch- execute on ALU- memory access- write back to register file

1120418 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each instruction is in a different stage but every stage is active

The pipeline is ldquofullrdquo here

1120418 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Examplei1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 0(r1)i4 add r5 r3 r4

Consider two possible instruction schedules (permutations)`

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

1120418 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in the same pipeline stage at the same time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

1120418 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints If instruction A computes a value that is read by instruction B then B canrsquot execute before A is completedResource hazards Finiteness of hardware function units means limited parallelism

1120418 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources bull finite set of FUs with instruction type and width and latency constraints bull cache hierarchy also has many constraints1048707 Data Dependences bull canrsquot consume a result before it is produced bull ambiguous dependences create many challenges1048707 Control Dependences bull impractical to schedule for all possible paths bull choosing an ldquoexpectedrdquo path may be difficult bull recovery costs can be non-trivial if you are wrong

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 22: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be overlapped It has the same principle as the assembly line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is accomplished by adding more hardware for parallel execution of stages and for dispatching instructions to them

1120418 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch- decode and register fetch- execute on ALU- memory access- write back to register file

1120418 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each instruction is in a different stage but every stage is active

The pipeline is ldquofullrdquo here

1120418 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Examplei1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 0(r1)i4 add r5 r3 r4

Consider two possible instruction schedules (permutations)`

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

1120418 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in the same pipeline stage at the same time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

1120418 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints If instruction A computes a value that is read by instruction B then B canrsquot execute before A is completedResource hazards Finiteness of hardware function units means limited parallelism

1120418 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources bull finite set of FUs with instruction type and width and latency constraints bull cache hierarchy also has many constraints1048707 Data Dependences bull canrsquot consume a result before it is produced bull ambiguous dependences create many challenges1048707 Control Dependences bull impractical to schedule for all possible paths bull choosing an ldquoexpectedrdquo path may be difficult bull recovery costs can be non-trivial if you are wrong

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 23: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch- decode and register fetch- execute on ALU- memory access- write back to register file

1120418 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each instruction is in a different stage but every stage is active

The pipeline is ldquofullrdquo here

1120418 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Examplei1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 0(r1)i4 add r5 r3 r4

Consider two possible instruction schedules (permutations)`

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

1120418 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in the same pipeline stage at the same time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

1120418 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints If instruction A computes a value that is read by instruction B then B canrsquot execute before A is completedResource hazards Finiteness of hardware function units means limited parallelism

1120418 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources bull finite set of FUs with instruction type and width and latency constraints bull cache hierarchy also has many constraints1048707 Data Dependences bull canrsquot consume a result before it is produced bull ambiguous dependences create many challenges1048707 Control Dependences bull impractical to schedule for all possible paths bull choosing an ldquoexpectedrdquo path may be difficult bull recovery costs can be non-trivial if you are wrong

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 24: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each instruction is in a different stage but every stage is active

The pipeline is ldquofullrdquo here

1120418 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Examplei1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 0(r1)i4 add r5 r3 r4

Consider two possible instruction schedules (permutations)`

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

1120418 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in the same pipeline stage at the same time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

1120418 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints If instruction A computes a value that is read by instruction B then B canrsquot execute before A is completedResource hazards Finiteness of hardware function units means limited parallelism

1120418 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources bull finite set of FUs with instruction type and width and latency constraints bull cache hierarchy also has many constraints1048707 Data Dependences bull canrsquot consume a result before it is produced bull ambiguous dependences create many challenges1048707 Control Dependences bull impractical to schedule for all possible paths bull choosing an ldquoexpectedrdquo path may be difficult bull recovery costs can be non-trivial if you are wrong

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 25: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Examplei1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 0(r1)i4 add r5 r3 r4

Consider two possible instruction schedules (permutations)`

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

1120418 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in the same pipeline stage at the same time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

1120418 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints If instruction A computes a value that is read by instruction B then B canrsquot execute before A is completedResource hazards Finiteness of hardware function units means limited parallelism

1120418 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources bull finite set of FUs with instruction type and width and latency constraints bull cache hierarchy also has many constraints1048707 Data Dependences bull canrsquot consume a result before it is produced bull ambiguous dependences create many challenges1048707 Control Dependences bull impractical to schedule for all possible paths bull choosing an ldquoexpectedrdquo path may be difficult bull recovery costs can be non-trivial if you are wrong

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 26: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in the same pipeline stage at the same time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

1120418 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints If instruction A computes a value that is read by instruction B then B canrsquot execute before A is completedResource hazards Finiteness of hardware function units means limited parallelism

1120418 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources bull finite set of FUs with instruction type and width and latency constraints bull cache hierarchy also has many constraints1048707 Data Dependences bull canrsquot consume a result before it is produced bull ambiguous dependences create many challenges1048707 Control Dependences bull impractical to schedule for all possible paths bull choosing an ldquoexpectedrdquo path may be difficult bull recovery costs can be non-trivial if you are wrong

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 27: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints If instruction A computes a value that is read by instruction B then B canrsquot execute before A is completedResource hazards Finiteness of hardware function units means limited parallelism

1120418 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources bull finite set of FUs with instruction type and width and latency constraints bull cache hierarchy also has many constraints1048707 Data Dependences bull canrsquot consume a result before it is produced bull ambiguous dependences create many challenges1048707 Control Dependences bull impractical to schedule for all possible paths bull choosing an ldquoexpectedrdquo path may be difficult bull recovery costs can be non-trivial if you are wrong

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 28: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources bull finite set of FUs with instruction type and width and latency constraints bull cache hierarchy also has many constraints1048707 Data Dependences bull canrsquot consume a result before it is produced bull ambiguous dependences create many challenges1048707 Control Dependences bull impractical to schedule for all possible paths bull choosing an ldquoexpectedrdquo path may be difficult bull recovery costs can be non-trivial if you are wrong

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 29: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 29

Legality Constraint for Instruction Scheduling

Question when must we preserve the order of two instructions i and j

Answer when there is a dependence from i to j

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 30: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to nodes and edges in the DDG to model the pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose value is the resource-reservation table associated with the operation type of this node

bull Each edge e from node j to node k is labeled with a weight (latency or delay) de indicting that the destination node j must be issued no earlier than de cycles after the source node k is issued

Dragon book 722

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 31: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 31

Example of a Weighted Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4i2

i3

i1 i4

1 1

31

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 32: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a pipeline consists of

A permutation f on 1hellipm such that f(j) identifies the new position of instruction j in the basic block For each DDG edge form j to k the schedule must satisfy f(j) lt= f(k)

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 33: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-timeAn instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction jbull No two instructions have the same start-time valuebull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 34: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 35: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 35

Example of Legal Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2i2 add r3 r3 r1 i3 lw r4 (r1)i4 add r5 r3 r4

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 36: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 36

Instruction Scheduling(Simplified)

Given an acyclic weighted data dependence graph G withbull Directed edges precedencebull Undirected edges resource

constraints

Determine a schedule S such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35d34

d45

d56

d46

d26

d24

Problem Statement

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 37: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 37

Simplify Resource Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 38: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 38

General Approaches ofInstruction Scheduling

List SchedulingTrace schedulingSoftware pipelining helliphellip

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 39: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 40: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 41: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic blockThe Basic Idea of list scheduling 1048707 All instructions are sorted according to some priority function Also Maintain a list of instructions that are ready to execute (ie data dependence constraints are satisfied) 1048707 Moving cycle-by-cycle through the schedule template bull choose instructions from the list amp schedule them (provided that machine resources are available) bull update the list for the next cycle

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 42: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach 1048707 Has forward and backward forms 1048707 Is the basis for most algorithms that

perform scheduling over regions larger than a single block

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 43: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 43

Heuristic SolutionGreedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0 4 While (ready-instructions is non-empty) do j = first

ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline Consider resource constraints beyond a single clean pipeline

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 44: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic SolutionGreedy List Scheduling

Algorithm

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 45: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 45

Special Performance Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt (here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 46: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 46

Properties of List Scheduling

bull The result is within a factor of two from the optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling hereNote we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the number of nodes in the DDG

bull In practice it is dominated by DDG building which itself is also O(n^2)

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 47: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0

ESTy] =MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )2 Set CPL = EST[END] the critical path length of the augmented

DDG3 Similarly compute LST (Latest Starting Time of All nodes)4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heuresticsNOTE there are other heurestics

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 48: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)Start 0 0 0i1 0 0 0i2 1 2 1i3 1 1 0i4 3 3 0END 4 4 0

==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

11

1

END0

START

0

0

01 01

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 49: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block1 Build the data dependence graph (DDG) for the basic

blockbull Node = target instructionbull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipelinebull Node weight = 1 for all nodesbull Edge weight = 1 for edges with loadstore instruction

as source node edge weight = 0 for all other edges

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 50: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must obey all ordering and timing constraints of the weighted DDG

4 Goal find a legal schedule with minimum completion time

Conrsquod

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 51: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors Number of total decendents Latency Last use of a variable Others

Note these heuristics help break ties but none dominates the others

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 52: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 52

Hazards Preventing Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 53: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global schedulingbull Trace schedulingbull Hyperblocksuperblock schedulingbull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite the loop as a repeating pattern that overlaps instructions from different iterations

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 54: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 54

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64

Smooth Info Smooth Info flowflow into Backend into Backend

in Open64in Open64 WHIRL

IGLSGRALRA

WHIRL-to-TOP

CGIR

Hyperblock FormationCritical Path Reduction

Inner Loop OptSoftware Pipelining

Extendedbasic blockoptimization

ControlFlowoptimization

Code Emission

Executable

Info

rmat

ion

fro

m F

ron

t en

d(a

lias

str

uct

ure

etc

)

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 55: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 55

Flowchart of Code Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-passGRA LRA EBOIGLS post-passControl Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt IEBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt IIEBO

EBOExtendedbasic blockoptimizationpeepholeetc

PQSPredicateQuery System

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling
Page 56: 2015/6/11\course\cpeg421-10F\Topic2a.ppt1 Topic 2a Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

1120418 coursecpeg421-10FTopic2appt 56

Software Pipelining vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processingsoftware pipelining

Code Emission

IGLS

GRALRA

IGLS

NoYes

Failurenot profitable

Success

  • Topic 2a Basic Back-End Optimization
  • ABET Outcome
  • Slide 3
  • Slide 4
  • Basic Concept and Motivation
  • Data Dependencies
  • Types of Data Dependencies
  • Slide 8
  • Data Dependence Graph (DDG)
  • Reordering Transformations using DDG
  • Reordering Transformations
  • Reordering Transformations (Conrsquot)
  • Data Dependence Graph
  • Slide 14
  • Should we consider input dependence
  • Applications of Data Dependence Graph
  • Data Dependence in Loops
  • Instruction Scheduling
  • Instruction scheduling (conrsquot)
  • Instruction scheduling A Simple Example
  • Hardware Parallelism
  • Pipelining amp Superscalar Processing
  • A Classic Five-Stage Pipeline
  • Pipeline Illustration
  • Parallelism in a pipeline
  • Superscalar Illustration
  • Parallelism Constraints
  • Scheduling Complications
  • Legality Constraint for Instruction Scheduling
  • Construct DDG with Weights
  • Example of a Weighted Data Dependence Graph
  • Legal Schedules for Pipeline
  • Legal Schedules for Pipeline (Conrsquot)
  • Slide 34
  • Example of Legal Schedules
  • Instruction Scheduling (Simplified)
  • Simplify Resource Constraints
  • General Approaches of Instruction Scheduling
  • Trace Scheduling
  • Software Pipelining
  • List Scheduling
  • List Scheduling (Conrsquot)
  • Heuristic Solution Greedy List Scheduling Algorithm
  • Slide 44
  • Special Performance Bounds
  • Properties of List Scheduling
  • A Heuristic Rank Function Based on Critical paths
  • Example of Rank Computation
  • Summary
  • Slide 50
  • Other Heurestics for Ranking
  • Hazards Preventing Pipelining
  • Local vs Global Scheduling
  • Slide 54
  • Slide 55
  • Software Pipelining vs Normal Scheduling