f001 cs/ee 5810 cs/ee 6810 chapter 2-3: basic program transformations

F001CS/EE 5810CS/EE 6810

Chapter 2-3: Basic Program Transformations

F002CS/EE 5810CS/EE 6810

Optimizing your code

Transforming the code to something different than the programmer wrote, but that still does the same thing, can have a huge impact on performance

Is this a compiler subject or a computer architecture subject? Yes!

Many many architectural details are driven by, or have an affect on, code optimization

F003CS/EE 5810CS/EE 6810

Major Types of OptimizationSee Chapter 2, Fig 2-19, P93

High level (At or near source level)Procedure integration

Local (Within a basic block)Common subexpression eliminationConstant propagationStack height reduction

Global (Across a branch)Copy propagationCode motion Induction variable elimination

Machine-DependentStrength reductionPipeline scheduling (more about this later…)

F004CS/EE 5810CS/EE 6810

Strength Reduction

Substitute a simpler operation when equivalentMultiply => shifts and adds is a popular area

Y = X ** 2; replace with Y = X * X;

J = K * 2; replace with J = K + K;

F005CS/EE 5810CS/EE 6810

Variable Renaming

Use distinct names for each unrelated use of the same variable to simplify later optimizations

X = Y * Z; Second use of X is unrelated Q = R + X + X;

X = A + B; Replace with X1 = A + B;

F006CS/EE 5810CS/EE 6810

Common Subexpression Elimination

Avoid recalculating the same expression In this code, you would hope the compiler would

compute the address of a[ j ][ k ] only once for both statements…

a[ j ][ k ] = b[ j ][ k ] + x * b[ j ][ j-1 ] ;

sum = length[ j ] * a[ j ][ k ];

F007CS/EE 5810CS/EE 6810

Loop Invariant Code Motion

Avoid operations in loops that are the same in each iteration

Originalfor ( j = 0; j < max; j++){ a[ j ] = b [ j ] + c * d; e = g[ k ]; }

Revisedtmp = c * d;for (j = 0; j < max; j++)

a[ j ] = b[ j ] + tmp;e = g[ k ];

F008CS/EE 5810CS/EE 6810

Copy Propagation

Propagate the original instead of the copy In this example, x is still copied to y, but then all

subsequent calls to x are replaced with yOriginal

x = y;z = 2 * x;q = x + 15;

Revisedx = y;z = 2 * y; q = y + 15;

We may find that x is never used again…

F009CS/EE 5810CS/EE 6810

Constant Folding

If the value of a variable is really a constant that can be determined at compile time, replace it with the

constant int j = 0;

int k = 1;

m = j + k;

F0010CS/EE 5810CS/EE 6810

Dead Code Removal

Eliminate instructions whose results are never used

update (){

int j, k;j = k = 1;j += 1;k += 2; printf{“ J is %d\n”, j);

}

F0011CS/EE 5810CS/EE 6810

Branch Delay Slots

Some machines (like DLX) always execute instructions in the Branch Delay Slot(s)

Challenge is for the compiler to find code to put in those slots (See Fig 3.28, P 169)

Three places to find such codeAn independent instruction from before the branch (Best choice) From the branch target (Risky, may need to copy the instruction,

can’t cause problem if executed incorrectly!) From the fall-through code (Risky, same problems as above…)

Compiler can hide ~70% of branch hazards on DLX running Spec92 codes.

F0012CS/EE 5810CS/EE 6810

Chapter 4: Pipeline Scheduling and ILP

F0013CS/EE 5810CS/EE 6810

Try producing fast code for

a = b + c;

d = e – f;

assuming a, b, c, d ,e, and f in memory. Slow code:

LW Rb,b

LW Rc,c

ADD Ra,Rb,Rc

SW a,Ra

LW Re,e

LW Rf,f

SUB Rd,Re,Rf

SW d,Rd

Software Scheduling to Avoid Load Hazards

Fast code:

LW Rb,b

LW Rc,c

LW Re,e

ADD Ra,Rb,Rc

LW Rf,f

SW a,Ra

SUB Rd,Re,Rf

SW d,Rd

F0014CS/EE 5810CS/EE 6810

Instruction Level Parallelism (ILP)

Pipelining supports a limited sense of ILPE.g. overlapped instructions, hazard issues, forwarding logic, etc.

Remember:

Pipeline CPI = Ideal CPI + Structural Stalls + Data Stalls + Control Stalls

So, let’s try to be more aggressive about reducing the stalls to improve the CPI…

F0015CS/EE 5810CS/EE 6810

Software Techniques

Loop unrollingBigger basic blocksAttempt to reduce control stalls

Basic pipeline schedulingReduce RAW stalls

Lots of other hardware techniques to talk about later…

F0016CS/EE 5810CS/EE 6810

ILP Within a Basic Block

Basic Block definitionStraight line code, no branches outSingle entry point at the topReal code is a bunch of basic blocks connected by branches

Notice:Branch frequency is approx 15% of total mix (for integer programs)This implies that basic block size is between 6 and 7 instructionsMachine instructions don’t do muchSo, there’s probably little in the way of ILP available

Easiest target is the loopAlready exploited by vector processors, but using different

mechanisms

F0017CS/EE 5810CS/EE 6810

Loop Level Parallelism

Consider adding two 1000 element arrays

for(I=1; I<=1000, I=I+1)x[I] = x[I] + y[I];

Sure it’s trivial, but it illustrates the pointThere is no dependence between data values produced in any

iteration j and those needed in j+n for any j and nTruly independent – hence could be 1000-way parallel Independence means no stalls due to data hazazrdsProblem is that we have to use that pesky branch instruction

Vector processor modelLoad vectors X and Y (up to some machine-dependent max)Then do result-vec = xvec + yvec in a single instruction

F0018CS/EE 5810CS/EE 6810

Assumptions About Timing

Default DLX pipeline timings for this chapter

Inst, Producing value Inst. Consuming Value Clock cycles to avoid stalls

FP ALU Op FP ALU OP 3

FP ALU Op Store Double 2

Load Double FP ALU Op 1

Load Double Store Double 0

Integer Load Integer ALU Op 1

Integer ALU Op Integer ALU Op 0

Branch Delay Slot Anything 1

F0019CS/EE 5810CS/EE 6810

Loop Unrolling Consider adding a scalar s to a vector (assume lowest array element is in location 0)

For (I = 1; I<=1000; I++) x[I] = x[I] + s;

Loop: LD F0, 0(r1) ; R1 array ptrADDD F4, F0, F2 ;Add scalar in F2SD 0(r1), F4 ; store resultSUBI r1, r1, 8 ; decr. Ptr by 8 bytesBNEZ r1, loop ; branch r1 != 0

How does it run without scheduling? 9 cycles per iteration

LD, LD stall, ADDD, 2 RAW stalls, SD, SUBI, BNEZ, Branch delay control stall

F0020CS/EE 5810CS/EE 6810

Loop Without and With SchedulingLoop: LD F0, 0(r1)

stallADDD F4, F0, F2stallstallSD 0(r1), f4SUBI R1, R1, #8BNEX R1, Loopstall

Loop: LD F0, 0(r1)stallADDD F4, F0, F2SUBI r1, r1, #8BNEZ R1, LoopSD 8(r1), F4

Note that this is non-trivial, and many compilers don’t even try Move SD to branch delay slot But, SUBI changes a register that SD needs! Since we moved it past the SUBI, need to adjust offset

Down to 6 cycles/loop, but still has 3 cycle loop+stall overhead

F0021CS/EE 5810CS/EE 6810

Loop Unrolling

Basic Idea – take n loop bodies and concatenate them into one basic blockWill need to adjust termination codeLet’s say n was 4Then modify the R1 pointer in the example by 4x of what it was

before => 32

Savings – 4 BNEZ’s + 4 SUBI’s => just one of each in new unrolled loopHence 75% savings

Problem: Still have 4 load stalls per loop

F0022CS/EE 5810CS/EE 6810

Unrolled Loop Examle

Loop: LD F0, 0(r1)ADDD F4, F0, F2SD 0(r1), F4 ; drop SUBI and BNEZLD F6, -8(r1)ADDD F8, F6, F2SD -8(r1), F8 ; drop SUBI and BNEZLD F10, -16(r1)ADDD F12, F10, F2SD -16(r1), F12 ; drop SUBI and BNEZLD F14, -24(r1)ADDD F16, F14, F2SD -24(r1), F16SUBI r1, r1, #32BNEZ Loop

F0023CS/EE 5810CS/EE 6810

Unrolling With Scheduling

Don’t concatenate the unrolled segments, Shuffle them instead

4 LDs then 4 ADDDs then 4 SDsNo more stalls since LD -> ADDD dependent path now

has 3 instructions in it… Result is 14 cycles for 4 elements

=> 3.5 cycles/elementCompare with 9 cycles with no schedulingor 6 cycles with scheduling but no unrolling

F0024CS/EE 5810CS/EE 6810

Loop Unrolling With Scheduling

Loop: LD F0, 0(r1)LD F6, -8(r1)LD F10, -16(r1)LD F14, -24(r1)ADDD F4, F0, F2ADDD F8, F6, F2 ADDD F12, F10, F2ADDD F16, F14, F2 SD 0(r1), F4SD -8(r1), F8SD -16(r1), F12SUBI r1, r1, #32BNEZ LoopSD 8(r1), F16 ; note 8-32 = -24

F0025CS/EE 5810CS/EE 6810

Things to Notice

We had 8 more unused register pairsWe could have gone to an 8 block unroll without register conflictNo problem since the 1000-element array would still have broken

cleanly (1000/8 – 125)What if it had not? Suppose the division has a remainder R? Just put R blocks (shuffled of course) in front of the loop, then start

for realEven if you run out of registers, you can still cycle names and

remove stalls.

Most compilers unroll early to expose code for later optimizationsThis one had a tricky one => SD/SUBI swapKey was independent nature of each loop bodyWhat if they’re not independent?

F0026CS/EE 5810CS/EE 6810

Data Dependency Analysis

Three types: Data, Name, and Control I is data dependent on j if:

I uses a result produced by jOr, I uses a result produced by K, and k depends on j

Dependence indicates a possible RAW hazardDoes it induce a stall? Depends on pipeline structure and forwarding

capability

Compiler dataflow analysis Creates a graph that makes these dependencies explicit directed

paths

F0027CS/EE 5810CS/EE 6810

Data Dependency

Loop: LD F0, 0(r1)

ADDD F4, F0, F2

SD 0(r1), F4

SUBI R1, R1, #8

BNEZ R1, Loop

F0028CS/EE 5810CS/EE 6810

Name Dependence

Occurs when second instruction uses same register name without a data dependenceE.g. unrolled loop without changing register names

Let I preceed j in program order I is antidependent on j when j writes a register that I reads

Essentially the same as a WAR hazard Hence ordering must be preserved to avoid the hazard

I is output dependent on j if they both write the same register Essentially a WAW hazard So we have to avoid that too

Otherwise, no real data dependence, just nameSo registers can be renamed statically by the compiler or

dynamically by the hardware

F0029CS/EE 5810CS/EE 6810

Control Dependence

Since branches are conditionalSome instructions will be executed, and others will notMust maintain order due to branches

Two obvious constraints to maintain control dep’s Instructions controlled by branch can’t be moved before the branch

(or they would become unconditional) Instructions not controlled by the branch can’t be moved after the

branch (or they would become conditional)

Simple pipelines preserve this so it’s not a big deal.

F0030CS/EE 5810CS/EE 6810

Loop-Carried Dependence

Consider the following code:

For(I=1; I<=1000; I++){A[I+1] = A[I] + C[I]; /* S1*/B[I+1] = B[I] + A[I+1];} /* S2 */

S1 uses S1 value produced in a previous iterationS2 uses S2 value produced in a previous iterationS2 uses an S1 value produced in the same iterationSo, S1 depends on a loop-carried dependence on S1Similar to S2’s loop carried dependence If non-loop-carried dependencies were the only ones,

could execute loop bodies in parallel

F0031CS/EE 5810CS/EE 6810

Another Loop Carried Dependence

S1 uses previous value of S2However, dependence is not circular since neither

statement depends on itselfAnd no S1 depends on S2 depends on S1 circularity

eitherSo, no cycle in dependencies, loop can be parallelized

and unrolled (provided statements are kept in order)

For(I=1; I<=100; I++){A[I]= A[I] + B[I]; /* S1 */B[I+1] = C[I] + D[I];} /* S2 */

A[1] = A[1] + B[1]; For(I=1; I<=99; I++){

B[I+1] = C[I] + D[I];A[I+1] = A[I+1] + B[I+1];}

B[101] = C[100] + D[100];

F0032CS/EE 5810CS/EE 6810

Our Infrastructure for Lab 1

In the /home/cs/handin/cs5810/bin directory on CADE lcc - DLX C compiler - use with -S switch to get

assembly code in a .s filedlxasm - Assembler that converts .s files into .dlx

object files that can run on our simulatorbin2a - A binary to ASCII converter that lets you look

at object files if you likedlxsim - A simulator for the DLX processor

Type h or ? At the prompt for brief listing of commandsOnly gives executed instruction counts at the momentYou’ll extend it later…

F0033CS/EE 5810CS/EE 6810

Data Infrastructure

In the /home/cs/handin/cs5810/ directoryNew directory for each lab

I.e. /home/cs/handin/cs5810/lab1

Also a src directory with benchmarks (small toy examples) in C

f001 cs/ee 5810 cs/ee 6810 chapter 2-3: basic program transformations

Documents