data dependence analysis of assembly code

International Journal of Parallel Programming, Vol. 28, No. 5, 2000

Data Dependence Analysis of AssemblyCode

Wolfram Amme,1 Peter Braun,1 Franc� ois Thomasset,2 andEberhard Zehendner3, 4

Received July 1999; revised March 2000

Determination of data dependences is a task typically performed with high-levellanguage source code in today's optimizing and parallelizing compilers. Verylittle work has been done in the field of data dependence analysis on assemblylanguage code, but this area will be of growing importance, e.g., for increasinginstruction-level parallelism. A central element of a data dependence analysis inthis case is a method for memory reference disambiguation which decideswhether two memory operations may access (or definitely access) the samememory location. In this paper we describe a new approach for the determina-tion of data dependences in assembly code. Our method is based on a sophisti-cated algorithm for symbolic value propagation, and it can derive value-baseddependences between memory operations instead of just address-based depen-dences. We have integrated our method into the Salto system for assemblylanguage optimization. Experimental results show that our approach greatlyimproves the precision of the dependence analysis in many cases.

KEY WORDS: Data dependence analysis; value-based dependences; memoryreference disambiguation; assembly code; monotone data flow frameworks.

1. INTRODUCTION

The determination of data dependences is nowadays most often done byparallelizing and optimizing compiler systems on the level of source code,

431

0885-7458�00�1000-0431�18.00�0 � 2000 Plenum Publishing Corporation

1 Computer Science Department, Friedrich Schiller University, D-07740 Jena, Germany.2 INRIA, Rocquencourt, 78153 Le Chesnay Cedex, France. E-mail: Francois.Thomasset�

inria.fr.3 Computer Science Department, Faculty of Mathematics and Computer Science, Friedrich

Schiller University, D-07740 Jena (P.O. Box), Germany.4 To whom correspondence should be addressed at e-mail: zehendner�acm.org.

e.g., C or FORTRAN 90, or some intermediate code, e.g., RTL.(1) Datadependence analysis on the level of assembly code aims at increasinginstruction-level parallelism. Using various scheduling techniques like listscheduling, (2) trace scheduling, (3) or percolation scheduling, (4) a newsequence of instructions is constructed with regard to data and controldependences, and properties of the target processor. Most of today'sinstruction schedulers only determine data dependences between registeraccesses, and consider memory to be one cell, so that every pair of memoryaccesses must be assumed as being data dependent. Analyzing memoryaccesses becomes particularly important while doing global instructionscheduling.(5)

Performing optimizations at the level of assembly code has manybenefits. The developer of machine-dependent optimization techniquesneeds a tool that is easily programmable and exposes (almost) all processorproperties. Implementing new techniques in an existing compiler (even aneasily retargetable compiler, such as the GNU C Compiler gcc(6)) failsbecause new processor features may not be expressible within the machinedescription. In addition, working on assembly code also gives the oppor-tunity to optimize during link-time. Wall(7) gives an overview of systemswhich perform the so-called late-code-modification; see Srivastava andWall(8) for a full description of such techniques. Another aspect concernsoptimization of delivered code. Fisher(9) makes the suggestion to translatean executable program during loading, e.g., replace multimedia extensionsby ``normal'' instructions or vice versa. When processors of the same familydiffer in the number of functional units or number of registers, rescheduling(10)

can be used to optimize for the new processor. A program can be madeexecutable on a completely different processor by the use of binary transla-tion.(11) In all these areas, a better analysis of assembly code and moreprecise data dependence information would offer significant benefits.

In this paper, we describe an intraprocedural value-based data depen-dence analysis.(12, 13) When analyzing data dependences in assembly codewe must distinguish between accesses to registers and those to memory.In both cases we derive data dependences from reaching definitions andreaching uses (cf. Section 3) information that we obtain by a monotonedata flow analysis. Register analysis does not involve any complication: theset of used and defined registers in one instruction can be established easilybecause registers do not have aliases. Therefore, determination of datadependences between register accesses is not in the scope of this paper.

For memory references we have to solve the aliasing problem, cf. Landiand Ryder;(14) and Wall(15): decide whether two memory references accessthe same location. We have to prove that two references always point tothe same location (must-alias) or must show that they never refer to the

432 Amme, Braun, Thomasset, and Zehendner

same location. If we cannot prove the latter, we would like to have a conser-vative approximation of all alias pairs (may-alias), i.e., memory referencesthat might refer to the same location. To derive all possible addresses thatmight be accessed by one memory instruction, we developed a symbolicvalue propagation algorithm. To compare symbolic values for memoryaddresses we use a modification of the GCD test.(16)

We implemented our technique in the context of the Salto tool.(17)

Salto is a framework to develop optimization and transformation techni-ques for various processors. The user describes the target processor usinga mixture of RTL and C language. A program written in assembly codecan then be analyzed and modified using an interface in C++. Saltoalready contains some kind of conflict analysis, (18) but only determinesaddress-based dependences between register accesses and assumes memoryto be one cell. The technique we present in this paper goes far beyond that.Experimental results indicate that our analysis can be more precise in thedetermination of data dependences than other previous methods.

The rest of this paper is structured as follows: In Section 2 we intro-duce our programming model. Section 3 gives a brief introduction to thefield of data dependence analysis and alias analysis in assembly code.Section 4 describes the concept of monotone data flow systems, which isthe theoretical basis of our approach. We present our method in detail inSection 5. Section 6 shows experimental results, in Section 7 we discussrelated work, and in Section 8 we conclude with an outlook to furtherdevelopments.

2. PROGRAMMING MODEL AND ASSUMPTIONS

In the following we assume a RISC instruction set that is stronglyinfluenced by the SPARC architecture. Note however that our analysis isnot limited to the SPARC. Memory is only accessed through load (ld) andstore (st) instructions. Memory references can have the following formats:(a) mem=M rx+M ry or (b) mem=M rx+offset. Use of a scaling factor isnot provided in this model but an addition would not be difficult. Ourmethod supports memory instructions that read or write blocks of anyreasonable size. [Note: We haven't looked at general memory-to-memoryinstructions yet, but that will be a topic for a further study. Since memoryblocks loaded to registers will always be small compared to the wholeaddress space, our method is sufficient at least in this respect.] For globalmemory access, the address (which is a label) first has to be moved to aregister; then the corresponding memory block can be read or written usinga memory instruction. Initialization of registers or copying the contents ofone register to another can be done by the mv instruction. Each logical or

433Data Dependence Analysis

arithmetic instruction has the following format: op src1 , src2 , dest. Theoperation op is executed on operands src1 and src2 ; the result is written toregister dest. An operand can be a register or an integer constant. Ourmethod requires that any register directly or indirectly involved in addresscalculation must not be changed by exception handling, and that a valuewritten to such a register can be retrieved afterwards.

2.1. Arithmetic on Addresses

We have to take care of the fact that computation of addresses maywrap around beyond 2L, where L is the number of bits used to representaddresses. Arithmetic operations in the processor are based on moduloarithmetic; we assume that all integer registers have the same width L andcalculation is performed modulo 2L with wrap around where applicable.Therefore we prescribe the calculation of the coefficients from Section 5 tobe performed modulo 2L, either in signed or in unsigned binary arithmetic.

2.2. Control Flow

Control transfer in the program might be by either unconditional (b)or conditional (bcc) branch instructions, or by call instructions. Controlflow is modeled using an intraprocedural control flow graph whose nodesstand for individual instructions, not basic blocks. Our method is able tohandle any control flow graph��including irreducible graphs��that startsat a single entry node. However, the validity of the results relies on theassumption that any actual control path in the program is compatible witha path in the control flow graph; i.e., when control leaves the intra-procedural control flow graph at a certain node n by a call instruction orby an exception, succession concerning the instructions in the intraproce-dural control flow graph necessarily is at a successor node of n.

2.3. Memory Classes

Runtime memory is divided into three classes: static or globalmemory, stack, and heap memory.(19) When an address unequivocallyreferences one of these classes, some simple memory reference disambigua-tion is feasible��cf. Section 3. However, to apply this analysis technique, wewould have to rely heavily on compiler conventions. In the current state ofour implementation we make no use of memory classes; in the future ourtool could be parameterized by any compiler conventions as well as by thetarget machine.


3. DATA DEPENDENCES IN ASSEMBLY CODE

In assembly programs we have two classes of locations in which aprogram can store data: registers and memory. For a definition of datadependences we can merge these classes and treat them both as memorylocations:

A statement S2 has a value-based data dependence(12, 13) on a state-ment S1 if S2 can be reached after the execution of S1 , and the followingconditions hold:

(i) Both statements access a common memory location l.(ii) At least one of them writes to l.

(iii) Between the execution of S1 and S2 , there is no statement S3

that also writes to l.

A data dependence is called a flow dependence if S1 writes to l whereasS2 reads from l. It is called an anti-dependence if S1 reads from l and S2

writes to l. It is an output dependence if both statements write to l. S1 andS2 are in conflict if conditions (i) and (ii) hold, whereas (iii) may or maynot be fulfilled. Another name for a conflict is address-based datadependence. [Note: Feautrier;(12) and Pugh and Wonnacott(13) call this amemory-based data dependence.] Conflict analysis is a frequently usedapproximation of data dependences.

The determination of data dependences can be achieved by differentmeans. The most commonly used for flow dependences or outputdependences is the calculation of reaching definitions. Finding reachingdefinitions(19) means to detect all statements where the value of a specificmemory location could have been written last. Once the reaching defini-tions have been determined, we are able to infer def-use and def-defassociations; a def-use pair of statements indicates a flow dependencebetween them, and a def-def pair shows an output dependence.

For finding anti-dependences, we take a dual approach. We determinewhat we call reaching uses for all statements. A statement s is a reachinguse for another statement t with respect to a specific memory location if thecontents of this location is read by s and may still be there when the con-trol flow reaches t. From reaching uses we can infer use-def associations,each indicating an anti-dependence.

Data dependence analysis of registers causes no problems. The set ofused and defined registers can be established for each instruction by itssemantics. For memory references we have to solve the aliasing problem,i.e., we have to determine whether two memory references access the samelocation. In the following we briefly review techniques for alias analysis ofmemory references.


File: 828J 513406 . By:SD . Date:17:07:00 . Time:08:14 LOP8M. V8.B. Page 01:01Codes: 2998 Signs: 2455 . Length: 44 pic 2 pts, 186 mm

Fig. 1. Sample SPARC code for different techniques of alias detection: (a) and (b) can besolved by instruction inspection, whereas (c) needs a sophisticated analysis.

Doing no alias analysis leads to the assumption that a load instructionis always dependent on a store instruction, and a store instruction is alwaysdependent on any memory instruction. A common technique in compile-time instruction schedulers is alias analysis by instruction inspection, wherethe scheduler looks at two instructions to see if it is obvious that differentmemory addresses are referenced. With this technique, independence of thememory references in Fig. 1a can be proved because the same base registerbut different offsets are used; by using some compiler conventions, wemight show independence of the memory references in Fig. 1b, since dif-ferent memory classes are referenced. Figure 1c shows an example wherethese techniques fail. By looking only at the second and the third state-ment, we miss the independence of these statements, since we ignore therelation between the registers Mo1 and Mfp that could be inferred from thefirst statement. This example makes it clear that a two-fold improvement isneeded. First, we need to save information about address arithmetic, andsecond, we need some kind of copy-propagation. Provided that we havesuch an algorithm, it would be easy to show that in the second statementregister Mo1 has the value Mfp-20 and thus there is no overlap betweenthe 4 bytes wide memory blocks starting at Mo1-4 and Mfp-20.

4. MONOTONE DATA FLOW ANALYSIS

The most important passes of our analysis are performed by data flowanalysis.(20) Therefore, several parts of the analysis have been describedas a monotone data flow framework. Formally, a monotone data flowframework is a triple MDFS=(DF, @DF , F ). DF is called the data flowinformation set, which is a formal description of the information that willbe propagated through the control flow graph. The meet operator @DF ofa data flow framework describes the effect of joining paths in the flowgraph, i.e., how the incoming data flow information is computed for a nodewith more than one predecessor. F�[ f : DF � DF ] is the set of oursemantic functions. To each node of the control flow graph we assign oneof these semantic functions, that specifies how the outgoing data flow infor-mation is derived from the incoming data flow information.


If the semantic functions are monotone and (DF, @DF ) forms a boundedsemi-lattice with a top element and a bottom element, (16) we can use ageneral iterative algorithm(20) that determines for each statement of thecontrol flow graph an element d # DF that is a safe approximation of thedata flow information reaching the statement. The solution of a monotonedata flow framework thus can be expressed as a function mdfs: STMT � DF.The initial data flow information at each statement is the top element ofthe corresponding semi-lattice.

We define a relation C= on the semi-lattice (DF, @DF ) of a monotonedata flow framework by

\a, b # DF : a C= b � a @DF b=a

Monotonicity of the semantic functions can be expressed using the rela-tion C=. Semantic function f # F is monotone iff

\a, b # DF : a C= b O f (a) C= f (b)

Note that for a statement s and for all e # DF that may reach statement s, thesolution obtained when applying the general algorithm satisfies mdfs(s) C= e.

4.1. Description of Semantic Functions

A unifying description of semantic functions can simplify the provingof monotonicity. If there is a suitable definition of a difference operator " ,the general form of a semantic function is given as5

fs(D)=(D"Ks(D)) @DF Gs(D)

where D stands for data flow information that is reaching statement s, anapplication of Ks(D) returns the part of the incoming data flow informa-tion that will be destroyed by the execution of statement s, and Gs(D)describes the data flow information that is introduced by the executionof s.

5. DETERMINATION OF DATA DEPENDENCES

Figure 2 shows an overview of our data dependence analysis formachine code. Our technique determines data dependences for memoryaccesses (and for register accesses) by the calculation of reaching definitionsand uses (RDU). For registers the determination of reaching definitions


5 Ks(D) is a ``kill function,'' and Gs(D) is a ``generate function.''

File: 828J 513408 . By:XX . Date:20:06:00 . Time:13:37 LOP8M. V8.B. Page 01:01Codes: 520 Signs: 96 . Length: 44 pic 2 pts, 186 mm


Fig

.2.

Ove

rvie

wof

the

dete

rmin

atio

nof

data

depe

nden

ces.

and uses can be performed by a well-known standard algorithm.(19) To usethis algorithm for data dependence analysis of memory accesses we have toderive may-alias information; in principle, we have to check whether twostorage accesses could refer to the same storage object. To improve theprecision of the data dependence analysis, must-alias information is needed,i.e., we have to find out whether two storage accesses always refer to thesame storage object. [Note: We calculate neither must-dependences nor afull may-alias or must-alias relation in this paper, although our data flowinformation would permit us to do so.]

5.1. Informal Description of the Method

The main task of our analysis concerns the determination of memoryaddresses. Therefore, we have to determine all possible values of registersin memory expressions. In some rare cases solving this problem is trivial,e.g., when a constant is moved to a register, or when adding two registers,for which the values are known. In contrast, there are situations, in whichit is in general impossible to derive the value of a register, e.g., when theregister is defined by a load instruction. In Section 8, we describe anapproach for alleviating this restriction. As we want to perform a preciseand safe analysis, we have to look for a way to work with these as yetunknown values.

5.1.1. Symbolic Value Sets

In our approach we use the concept of symbolic values. A statementj is called an initialization point if j defines a register with a value we arenot able to relate to a constant or to the values residing in other registers.We distinguish artificial initialization points from natural initializationpoints in our programming model. Natural initialization points are loadinstructions, call nodes, entry nodes of a procedure, or instructions operatingon their operands in a nonlinear way. An artificial initialization point is akind of dummy assignment that is inserted into the original program byour method as explained later. If an initialization point j defines the con-tents of register ri we refer to this unknown value through the symbol Ri, j .

In this method we calculate possible symbolic value sets for eachregister and each program statement. A symbolic value set is a set of linearpolynomials, where each polynomial stands for a possible content of theassociated register. Variables of such polynomials are represented by thesymbols Ri, j , i.e., the definition values of initialization points. Figure 3shows the calculation of symbolic value sets for a simple assembly codewith control flow. For each statement we determined symbolic value sets



Fig. 3. Results of a symbolic value set propagation for a simple procedurecode with control flow. Registers ri that are not mentioned have the value[Ri, 0].

that describe the register contents immediately before the execution of thestatement.

There could be a register defined inside a loop��or, more generally,inside a cycle of the control flow graph��whose contents would change ineach iteration in a predictable way, e.g., by an increment instruction.[A new iteration of a cycle is started when passing a reference point in thecycle, called the cycle head in the sequel.] Then our data flow propagationalgorithm might not terminate. However, by limiting the cardinality ofeach symbolic value set we can enforce termination. One method to achievethis goal is k-bounding where the data flow information becomes the bot-tom element = of the semi-lattice whenever no symbolic value set with atmost k elements would constitute a safe approximation. In fact, inflation ofunbounded symbolic value sets can also happen in other situations, forinstance through the control flow induced by deeply nested conditionals.For this reason, we are motivated to employ k-bounding anyway, althoughwe tend to use not too small values for k; usually the precision of theanalysis raises when we take a larger k. Concerning cycles, this decision hastwo consequences: First, our analysis��although guaranteed to terminate��could still need a large number of iterations for stabilizing. Second, in mostcases we would not retain useful information for registers changing insidea cycle since nothing can be inferred from the value =.

Compatible with k-bounding��both techniques can be used togetheror independently��we introduce the concept of artificial initialization pointsto achieve fast termination as well as high precision with respect to cycles.An artificial initialization point is a kind of dummy assignment to aregister, inserted between an entry point to a cycle in the control flowgraph and the first instruction after the entry that corresponds to amachine instruction. This dummy assignment causes the symbolic value set



of the register in question to be set to a symbol, as with a natural initializa-tion point. Thus changes to the data flow information for this register,caused by running once through the whole cycle, are not propagated to thenext iteration, forcing early termination of the analysis. [Note as a draw-back that without the use of induction variables, some of the availableinformation may be obscured. See also our discussion in Sections 6�8.]Artificial initialization points therefore only make sense for registers thatchange inside the cycle. As another advantage, to a certain amount we cancompare symbolic value sets for memory addresses derived from registersthat change inside a cycle.

Figure 4 shows the results of a symbolic value set propagation usingartificial initialization points for a simple loop code. [See Fig. 6 for thecomplete source and assembly code.] Changing registers of the loop are r0,r1, r2, and r3. For each changing register an artificial initialization pointcould be inserted into the program. However, for reasons explained inSection 5.2, a single initialization point is sufficient here. As a consequence,the data flow algorithm terminates already during the second iteration.Moreover, without the artificial initialization point for register r0, the valueof register r1 would have been set to = eventually, disabling anydependence analysis. We continue to discuss this example in Section 5.6.

5.1.2. Symbolic Address Sets

Calculation of symbolic value sets for registers is necessary to deter-mine the symbolic address set of a memory statement. In a subsequent stepof our analysis we use symbolic address information and information ofcontrol flow for the determination of must-alias information as well asreaching definitions and uses. From these and from may-alias information,we derive data dependence information of memory accesses in the last step.

In order to obtain all this information we need a mechanism whichchecks whether the index expressions of two storage accesses X and Y

Fig. 4. Symbolic value set propagation fora loop. Register contents are only mentionedafter they changed.


could (or definitely) represent the same value. To solve this problem, wereplace the appearances of registers in X and Y with elements of theircorresponding symbolic value sets, and check for all possible combinationswhether the equation X&Y=0 has a solution.

For an example, we refer to Fig. 3. Obviously, instruction 1 is a reach-ing use of memory in instruction 11. The derived memory addresses areR31, 0+68 for instruction 1, and R31, 0+72 or R31, 0+76 for instruction 11.Given the assumption that both instructions access memory words having4 bytes, we can prove that disjoint memory locations will be concerned,thus showing independence of these memory instructions.

When accessing nonaligned memory blocks or such of different length,the independence test becomes more demanding. We then have to check forpossible intersections of memory blocks, meaning that we show theinfeasibility of a constraint problem.

5.2. Handling Cycles in the Control Flow Graph

The chosen front-end, Salto in our case, provides us with a classicalcontrol flow graph whose nodes are basic blocks. During a depth-firsttraversal of the given control flow graph we expand each basic block to itscorresponding sequence of instructions, thus simplifying our implementa-tion. Each elementary instruction constitutes a separate node in the result-ing expanded control flow graph. These nodes are placed into an array inreverse postorder. The index of a node in the array is called its depth-firstnumber(19) and serves to identify the node. For each node s we store someinformation concerning the node, in particular the set of control flowpredecessors pred(s). The new structure supports efficient depth-first traver-sals of the control flow graph or some subgraph, needed during data flowforward propagation and some preparative phases.

Next we identify simple cycles in the control flow graph. For eachcycle we take the node with the least depth-first number as a referencepoint, called the head of the cycle. Several cycles can share a single head.It is not necessary to identify individual cycles. Instead, we calculate the setof all nodes belonging to any of the simple cycles that share a head; we callthis set of nodes a cycle pack. [In a reducible control flow graph, a cyclepack is simply a loop body.]

We can identify each cycle head as the destination of a control flowback edge with respect to the reverse postorder.(21) An arc ( f, h) in thecontrol flow graph is called a back edge iff h� f in the reverse postorder.Thus h is a cycle head iff h� f for some arc ( f, h). To find out whether his a cycle head, we check for h� f over all predecessor nodes f of h in



Fig. 5. Sample loop codes motivating initialization points.

the control flow graph. Remember that we stored a description of thesepredecessors with h in our representation.

The source of a back edge targetting h we call a back node of h. It isconvenient to also store a description of the back nodes along with thenode information of h since we have to scan over these nodes in succeedingphases of our analysis. Formally, the set of back nodes is given byback(h)=[ f # pred(h) : h� f ].

5.2.1. Placing Artificial Initialization Points

Placement of artificial initialization points at cycle heads, only, issufficient to force termination. To determine for which registers we needinitialization points, we analyze the definitions and uses of the registersinside a cycle pack to some depth. We find the nodes belonging to a cyclepack by backward depth-first search, starting from the sources of the backedges corresponding to the cycle head. As a side-effect, we detect all entriesto the cycle pack.

An artificial initialization point is required at the head of a cycle packfor a register that is defined within the cycle pack and that satisfies eitherof the following conditions:

1. The register is also used within the cycle pack, and a use at oneiteration sees a definition from a previous iteration. This may, forinstance, happen when the register is used before being defined��registers r1 and r2 in Fig. 5a��or when use and definition appearin different branches��register r1 in Fig. 5b.6

2. The register is used after exit from the cycle pack, and such a usesees a definition from an iteration not being the last one beforeleaving the cycle pack��register r2 in Fig. 5c.


6 We prefer an intermediate notation instead of assembly code here because it is easier to read.

To free the analysis from searching through the whole control flowgraph following a cycle pack,7 we relax the conditions concerning thesecond case, and prescribe an artificial initialization point for a registeralready when some code following a cycle pack could observe a definitionfrom an iteration not being the last one before leaving the cycle pack. Wecall a node e contained in a cycle pack an exit node of the cycle pack whenthe control flow may leave the cycle pack there, i.e., when any node n out-side the cycle pack has e as a predecessor. By exit(h) we denote the set ofall exit nodes corresponding to cycle head h.

We now introduce basic definitions that we need for the formaldescription of our method. Let REGS be the set of all registers, and STMTthe set of all statements. Furthermore, let defreg and usereg be defined asfollowing:

defreg: STMT � P(REGS ),

statement s may write to all registers in defreg(s),

usereg: STMT � P(REGS ),

statement s may read from any register in usereg(s).

As mentioned before, defreg and usereg can be derived directly from thesemantics of the instruction set of the supposed target processor.

The set of all symbols is denoted as SYM. Note that all symbols thatwe introduced for initialization points are elements of SYM. The set of allinitialization points of a program is given by the image of function ip,

ip: SYM � STMT, symbol x is defined by statement ip(x)

To determine the necessity of initialization points at the head h of acycle pack, we investigate definitions and uses of registers on all controlflow paths that completely belong to the cycle pack��excluding back edgeswith respect to h��starting at h and ending at a node n. This includes theeffects induced by n itself. We calculate the following information:

,(n)=set of registers r defined on every path to n,

+(n)=set of registers r used on some path to n

but not defined before this use,

�(n)=set of registers r defined on at least one path to n


7 This would imply a liveness analysis at the exit nodes.

We compute this information by a MDFS, where the domain of thesemi-lattice is the Cartesian product

P(REGS )_P(REGS )_P(REGS )

The meet operator is8

(,, +, �) @ (,$, +$, �$)=(, & ,$, + _ +$, � _ �$)

inducing (REGS, <, <) as top element and (<, REGS, REGS ) as bottomelement. The semantic function for the cycle head h is

fh=(defreg(h), usereg(h), defreg(h))

that for any other node n in the cycle pack is

fn(,, +, �)=(, _ defreg(n), + _ (usereg(n)",), � _ defreg(n))

Initialization points are only needed for registers from the set

.f # back(h)

�( f ) & \ .f # back(h)

+( f ) _ .e # exit(h)

�(e)+Although a heuristics at the moment only, we take for granted that

on average our decisions will not harm precision. In some cases it mightbe more advantageous not to insert an artificial initialization point for acertain register, or to place it at another node. We doubt, however, thatimproved strategies could be implemented without increasing analysis timeby a large factor.

5.3. Symbolic Value Propagation

To describe functional relations between register contents and certaininitial values represented by symbols from SYM, we use polynomials thatare linear in the symbols. Each such polynomial can be described formallyas a mapping from the symbols to the integers. For representing a constantadditive term we introduce = as an artificial symbol: SYM =SYM _ [=].

Then, the space SV (symbolic values) of all formal linear polynomialsin the symbols with coefficients from the integers can be described as


8 Edges entering a cycle pack must not be considered.

follows: SV=[SYM � Z]. A more familiar representation of such a poly-nomial is as a formal sum

v=a+:h

ah } xh with a, ah # Z, xh # SYM

5.3.1. Free Symbols

We describe the set of symbols with nonzero coefficients in a poly-nomial��the so-called free symbols��by a function

free: SV � P(SYM ), free(v)=[x # SYM : v(x){0]

We write v=a as a shortcut when free(v)=< and v(=)=a.

5.3.2. Operations

Since we want to stay within the space of linear polynomials, onlysome of the known operations on polynomials are feasible:

\v, w # SV, x # SYM : (v�w)(x)=v(x)+w(x),

\v, w # SV, x # SYM : (v�w)(x)=v(x)&w(x),

\v # SV, x # SYM : (�v)(x)=&v(x),

\v # SV, x # SYM , c # Z : (c x v)(x)=(v x c)(x)=c } v(x)

We use SV =SV _ [=] to represent unknown or approximated values.All coefficients ought to be residuals because the machine calculates

modulo 2L. However, since addition, subtraction, and multiplication providering homomorphisms between Z and Z2L , we can pretend to calculate in Zand perform the reduction implicitly during the dependence test. In ourimplementation this reduction is implicit in an even more natural waybecause we store the coefficients to registers of width L bits.

5.3.3. A Bounded Semi-Lattice

Since we abstract from the predicates of conditionals, we must workwith sets of values from SV. To do an effective analysis each such setshould contain only a small number of elements. An appropriate ``boundingconcept'' is introduced via the space SVS=[S�SV : |S |�l ] (symbolicvalue sets), where l is some predetermined natural. The size of l influencesthe precision of the analysis and should be chosen carefully to fit thedemands. Furthermore, since = stands for any value, we can restrict in thisrespect to a single set containing = alone, SVS =SVS _ [[=]]. [We use[=] instead of = to have the opportunity to enumerate sets from SVS orto build Cartesian products from them.]


On SVS we define a meet operator @SVS : SVS_SVS � SVS by

S @SVS T={S _ T,[=],

if S, T # SVS and |S _ T |�l;otherwise.

(SVS , @SVS ) is a bounded semi-lattice with bottom element [=] and topelement <. It seems that making the meet operator more precise than herewould not be of interest for our analysis.

5.3.4. Approximation of the Values in Registers

What we are now looking for is an approximation of the values thatthe registers may contain when reaching a certain statement. This approxi-mation must be safe, in the sense that not every value found during theanalysis necessarily appears during runtime, but all values that do in factshow up have to be considered in the approximating set. We define

regvals: STMT � [REGS_SVS ], (r, v) # regvals(s) �

register r may contain value v when read by statement s,

and calculate regvals by the following MDFS:

Data flow information set:9 RSV=[REGS_SVS ].Top element: <.Meet operator @RSV : RSV_RSV � RSV,

\r # REGS: ( f @RSV g)(r)= f (r) @SVS g(r)

Semantic functions:

Ks(D)=defreg(s)_SVS ,

Gs(D)=[ri_F(D, gi , ri1,..., rik

): ri # defreg(s)],

(abbreviation: [ gi ; ri1,..., rik

] i )

where the functions gi depend on the instruction performed, and

F(D, g, ri1,..., rik

)=�SVS [ g(v1 ,..., vk): (rij, vj ) # D]


9 Actually, the information at each program point is a function from the registers, but wefound it convenient to define it as a relation, so as to simplify the definition of the semanticfunctions.

in the sequel, with

g(v1 ,..., vk)=[ g(v1 ,..., vk)]

Semantic functions have to be defined for any instruction. The mostimportant cases here are:

mov a, ri [a; ] i with a( )=a,

mov rj, ri [id ; rj ] i with id(v)=v,

init ri [unknown; ] i with unknown( )=Ri, s # SYM,

ld [mem], ri [unknown; ] i ,

entry [unknown; ] i for all registers ri ,

call [unknown; ] i for a suitable subset of the registers,

add rj, rk, ri [add ; rj , rk] i with

add(v, w)={=,v�w,

if v== or w==;otherwise,

sub rj, rk, ri [sub; rj , rk] i with

sub(v, w)={=,v � w,

if v== or w==;otherwise,

mul rj, rk, ri [mul; rj , rk] i with

mul(v, w)={0,c x w,v x c,=,

if v=0 or w=0;if v=c # Z and w{=;if v{= and w=c # Z;otherwise,

div rj, rk, ri [div; rj , rk] i with

v, if w=1;

div(v, w)={� v, if v{= and w=&1;

=, otherwise.

There are variants of these instructions with integer constants usedinstead of operand registers; these are treated analogously, one of the poly-nomials then degenerating to the constant additive term. We have Dout

s =Din

s if defreg(s)=<. For all other instructions not treated so far we set[unknown; ]i for all registers ri with ri # defreg(s). Most of these functions


with some programming effort could be modeled a bit more precisely��e.g.,shift left like mul, shift right like div, exact handling of Boolean operators,etc. We already added this to our work list.

For division, we only handle trivial cases since in most cases we can-not find a useful approximation for the reminder. But even if we could, wewould have to multiply by reciprocals with respect to Z2L causing mucheffort in calculation time or storage space. Note also that from y=2 x x wecould not infer y�2=x in Z2L ; to see that, take x=2L&1, for instance.

The reader might doubt that a symbol propagated in different registersto an add instruction that uses both registers as operands has the samemeaning in the polynomials and thus can be added. The feasibility of thisoperation follows from the fact that initialization points introduce newincarnations of symbols in each iteration for all such registers.

Remark 1. The assumption that all calculations in the processor bemodulo 2L is essential for addition, subtraction, and other instructions.Should the processor trap with overflow, functions like add must be definedin a much more restrictive way than described to get a safe approximation.For instance, we could set

add(v, w)={v�w,w,v,=,

if v, w{= and v�w # Int;if v=0;if w=0;otherwise,

and

sub(v, w)={v � w,=,

if v, w{= and v � w # Int;otherwise,

where Int is the set of integers representable in a register. = here alsomodels possible overflow situations that could terminate the execution ofthe instruction with an exception.

5.4. Evaluating Address Expressions

The results of our symbolic value set propagation have to be sub-stituted into the address expressions of all memory instructions to get asafe approximation of the addresses accessed by these instructions. Toperform this substitution we have to touch each instruction only once. Weassume that the only instructions involving memory are loads, stores, and


calls. The set of addresses possibly accessed by an instruction is calculatedas follows:

addr: STMT � SVS , v # addr(s) �

v may be the address of a memory cell accessed in statement s,

addr(s)={F(regvals(s), add(a, .), rj ),

if [rj+a] is the address expression;

F(regvals(s), add, rj , rk),

if [rj+rk] is the address expression;

[=], if s is a call instruction;

<, otherwise.

Now, we have to distinguish whether an instruction reads frommemory or writes to it. Thus, we derive functions defmem and usememfrom addr:

defmem: STMT � SVS , v # defmem(s) �

statement s possibly writes to address v,

defmem(s)={addr(s),<,

if s is a store or call instruction;otherwise,

usemem: STMT � SVS , v # usemem(s) �

statement s possibly reads from address v,

usemem(s)={addr(s),<,

if s is a load or call instruction;otherwise.

The handling of call instructions here is quite approximative and couldpossibly be improved by distinguishing certain storage areas with specificbottom symbols.

5.5. Reaching Definitions and Uses of Memory

We are now ready for setting up reaching definitions and uses ofmemory. We formulate this pass as a MDFS that propagates a set ofpossible reaching definitions and a set of possible reaching uses to the


successors of a statement. In order to keep the reaching sets as small aspossible��and thus the analysis as precise as could be��we check by aspecial kind of must-alias analysis whether a definition (or a use) from aprevious statement t reaching a statement s will not be visible to the succes-sors of s. To assure this, addr(t) must completely be covered by defmem(s);we formulate this by a predicate covers. For deciding the predicate covershere or the predicate cut in Section 5.6, we could in principle use a tool likethe Omega Library(22) that manipulates linear constraints. However, ourmethod produces very particular constraints that are more efficientlysolved by a direct approach. [Note for clarification that calculating must-dependences is out of the scope of this paper. Also, we partly determineand exploit may-aliases in Section 5.6, but not here. And we do not derivethe full must-alias relation in this paper. However, our data flow informationwould permit us to provide all these relations to full extent.] We definefunctions for reaching definitions and uses,

rdmem: STMT � P(STMT ), t # rdmem(s) �

s could observe a storage contents written by t,

rumem: STMT � P(STMT ), t # rumem(s) �

s could observe a storage contents read by t

and calculate them by the following MDFS:

Data flow information set: STMT.Top element: <.Meet operator @STMT : STMT_STMT � STMT, S @STMT T=S _ T.Semantic functions for reaching definitions:

Gs(D)={[s],<,

if s is a store or call instruction;otherwise,

Ks(D)=[t # D : covers(s, defmem(s), t, defmem(t))]

Semantic functions for reaching uses:

Gs(D)={[s],<,

if s is a load or call instruction;otherwise,

Ks(D)=[t # D : covers(s, defmem(s), t, usemem(t))]


Because the empty address set or the unknown address = can nevercover any address, and since the unknown address = can never be coveredby any address, we find10

covers(s, A, t, B)={ false,covers2(s, A, t, B),

if A=< or A=[=] or B=[=];otherwise

A nonempty address set A can only cover another nonempty11 addressset B if each element in B is covered by each element in A, thus

covers2(s, A, t, B)= �(a, b) # A_B

covers3(s, a, t, b)

The arguments a and b of covers3 may contain some free symbols. Weintroduce a predicate sep that describes the situation where the initializa-tion points of two symbols to be compared do not share any control path.Cases where such symbols appear together in comparisons are spurious;they are due to the abstraction of control flow where the predicates ofconditionals are not used in deciding on paths to be taken. Thus

covers3(s, a, t, b)={true, if sep(x, y) for some x # free(a), y # free(b);covers4(s, a, t, b � a), otherwise

where the predicate

sep�SYM_SYM, (x, y) # sep �

the symbols x and y are not defined on a common path,

can be derived as

sep(x, y) � ip(x){ip( y) and

ip(x) � reaches(ip( y)) and ip( y) � reaches(ip(x))

from a function reaches that describes which statements reach a statementin the control flow graph,

reaches: STMT � P(STMT ), t # reaches(s) �

there is a path from statement t to statement s


10 In an implementation of the formulas from this section as well as from other sections of thepaper, we might make careful use of lazy evaluation rules for Boolean operators.

11 The case B=< cannot appear.

We calculate reaches by the following MDFS:

Data flow information set: STMT.Top element: <.Meet operator @ : STMT_STMT � STMT, S @ T=S _ T.Semantic functions: Ks(D)=<, Gs(D)=[s].

When comparing two addresses for coverage we have to form the dif-ference between them. This difference is itself a polynomial, and when itcontains a free symbol then we might find a valuation of the symbols wherethis difference is large in magnitude and thus coverage is not assured. Thus,covers4 never has the value true when there are any free symbols in thedifference polynomial,

covers4(s, a, t, p)={ false,covers5(s, a, t, p),

if free( p){<;otherwise

Moreover, eliminating free symbols from both argument polynomialsby assuming equality is only feasible when such a symbol has the samemeaning in both contexts. To delete a statement t from a reaching set arrivingat statement s, we have to be sure that (i) the memory contents addressedin statement t is overwritten every time the control flow passes statement s,and that (ii) there is no way to bypass statement s between the use of asymbol x in statement t and its redefinition. The latter is checked with apredicate shields,12

shields: STMT_STMT_STMT � BOOL,

shields(s, u, t) � every path from s to t contains u

Notice that we use shields only for t of the form t=ip(x). For fixed t,we can observe the similarity with postdominance,13 so that a straight-forward extension of the computation of postdominance delivers, for anyfixed t, the set [u: shields(s, u, t)]. Furthermore, if we compute a tree torepresent this relation, similar to the post dominance tree, and adopt thecoding of trees suggested by Brandis, (23) then the required amount ofmemory is of the order O( |STMT |_|ip(SYM )| ).

Finally, we check whether the memory block of size(s) bytes startingat address a completely covers the memory block of size(t) bytes startingat address b14:


12 We shall say that u shields s with respect to t.13 If t is the final node, then this is exactly the postdominance relation.14 If the number of bytes read or written by a memory instruction is always the same, the

formula size(t)�p+size(t)�size(s) in covers5 simplifies to p=0.

(size(t)�p+size(t)�size(s)),

covers5(s, a, t, p)={ if shields(t, s, ip(x)) for all x # free(a);

false, otherwise,

where the function size: STMT � N reflects the number of bytes accessedin memory.

5.6. Determination of Value-Based, Memory-InducedDependences

To derive value-based, memory-induced dependences from reachingdefinitions (or reaching uses) for memory, we have to check whether thememory cells accessed in one statement may intersect with the memorycells accessed in another statement. The intersection is checked using thecut predicate, calculating a kind of may-alias information. Given this, thedata dependences are as follows:

t is flow dependent on s �

s # rdmem(t) and cut(s, defmem(s), t, usemem(t)),

t is anti-dependent on s �

s # rumem(t) and cut(s, usemem(s), t, defmem(t)),

t is output dependent on s �

s # rdmem(t) and cut(s, defmem(s), t, defmem(t))

An unknown address = may intersect with any address, thus

cut(s, A, t, B)={true,cut2(s, A, t, B),

if A=[=] or B=[=];otherwise

An intersection between sets of addresses can only be excluded whenthere is definitely no intersection for any pair of addresses, meaning

cut2(s, A, t, B)= �(a, b) # A_B

cut3(s, a, t, b)

If addresses contain symbols that are incompatible, they can neverappear together, so

cut3(s, a, t, b)={ false, if sep(x, y) for some x # free(a), y # free(b);cut4(s, a, t, b), otherwise


Now, we have to form a difference polynomial for the final intersectiontest. A common free symbol from the polynomials a and b might take dif-ferent values in both polynomials. For correctly setting up the differencepolynomial we have to substitute a new symbol (for the common one) intoone of the polynomials if its value is not fixed. The latter property ischecked by the predicate variant.

Since we are interested in disproving iteration-independent dependences,we define the variant predicate with respect to a cycle head. So variant(x, h)should hold whenever we cannot show that the symbol x has a uniquemeaning during one iteration of the cycle pack with head h. [For a completerun of the procedure, we formally take h to be the procedure entry node.]A safe approximation then is

variant(x, h)=\f # back(h): cshields(ip(x), f, ip(x))

We can improve this approximation by setting variant(x, h)= false inthe following special cases:

1. ip(x)=h, or

2. the cycle pack is a loop that comprises only reducible code, and itis the innermost loop containing ip(x).15

The substitution process is described by

cut4(s, a, t, b)=cut5(size(s), a$, size(t), b)

with

a$=a[dup(x)�x, for all x # free(a) & free(b) such that variant(x, h)]

where a[dup(x)�x...] means that we substitute a new symbol for each con-flicting old one, all at the same time. The function dup has to provideunique duplicates for the symbols appearing in the semantic functionswithout interfering with the latter. Assuming that statements are countedup from 0, an easy way to achieve this would be by dup: SYM � SYM,Ri, s [ Ri, &s&1 .


15 A cycle head h is the entry node of a loop if h dominates all f # back(h). During depth-firstsearch we detect whether the body of the loop has a reducible control flow graph. Whenchecking the loop entries following the reverse postorder, inner loops are scanned afterouter loops; thus we can easily determine the nesting of loops, and in particular find foreach node the innermost loop it belongs to.

The difference polynomial is implicitly formed in

cut5(ss, a$, st, b)=\�1&ss&c$ |�\ st&1&c

$ �+where

p=a$&b, c= p(=), and $=gcd(2L, gcdx # free( p)

[ p(x)])

We refer to the appendix for the derivation of this formula. This implementswhat can be considered as an extension of the GCD test��the best we cando in the current situation. As shown in the appendix, the test amounts toasking that an integer multiple of $ be comprised between the numbers(1&ss&c) and (st&1&c); so it corresponds to the classical GCD test,applied to all values in this interval.

We demonstrate our dependence test by analyzing the code fromFig. 4. Consider, for example, the problem of determining whether thestore from instruction 7 is anti-dependent on one of the previous loads,say instruction 5.16 We find 5 # rumem(7). Therefore we computecut(5, usemem(5), 7, defmem(7)). We have usemem(5)=[8 V R0, 1+4],defmem(7)=[8 V R0, 1+8], and sep(R0, 1 , R0, 1)= false. So,

cut(5, [8 V R0, 1+4], 7, [8 V R0, 1+8])

=cut2(5, 8 V R0, 1+4, 7, 8 V R0, 1+8)

=cut3(5, 8 V R0, 1+4, 7, 8 V R0, 1+8)

=cut4(5, 8 V R0, 1+4, 7, 8 V R0, 1+8)

=cut5(size(5), 8 V R$0, 1+4, size(7), 8 V R0, 1+8)

where R$0, 1 is a new symbol. This gives us p=(8 V R$0, 1+4)&(8 V R0, 1+8),c=&4, and $=8. Since we assume that the memory instructions in questionaccess words of 4 bytes, each, we get ss=size(5)=4 and st=size(7)=4.

Thus the result is:

cut5(4, 8 V R$0, 1+4, 4, 8 V R0, 1+8)

=\�1&4&(&4)8 |�\4&1&(&4)

8 �+=\�18|�\7

8�+=(1�0)= false

so we can conclude absence of dependence.


16 The answer with the other load will be the same.

6. IMPLEMENTATION AND RESULTS

The method for determining data dependences in assembly codepresented in the last sections was implemented for reducible code as auser function in Salto on a Sun SPARCstation 10, with the followingsimplifications:

covers3(s, a, t, b)=covers4(s, a, t, b � a),

covers5(s, a, t, p)={ false, if variant(x, h) for some x # free(a);(size(t)� p+size(t)�size(s)), otherwise,

cut3(s, a, t, b)=cut4(s, a, t, b)

Presently, only the assembly code for the SPARC instruction set canbe analyzed, but an extension to other processors will require minimaltechnical effort. Results of our analysis can be used by tools built uponSalto.

For evaluation of our method we have taken a closer look at twoaspects:

1. Comparison of the number of data dependences using our methodagainst the method implemented in Salto; this shows thedifference between address-based dependence analysis and value-based dependence analysis concerning register accesses.

2. Comparison of the number of data dependences for memoryaccesses using address-based dependence analysis versus value-based dependence analysis.

As a sample we chose 160 procedures out of the sixth public release of theIndependent JPEG Group's free JPEG software, a package for compres-sion and decompression of JPEG images. We distinguish between thefollowing four levels of precision in the analysis:

Level 1. Address-based dependences between register accesses.Memory is modeled as one cell, i.e., every pair of memory accesses isassumed to introduce a data dependence.

Level 2. Value-based dependence analysis for register accesses,memory is modeled as one cell.

Level 3. Value-based dependence analysis for register accesses, andaddress-based dependence analysis for memory accesses.

Level 4. Value-based dependence analysis for register accesses,value-based dependence analysis for memory accesses.


Salto(17) in principal can be classed at level 1, although it is some-times less precise than our level 1 analysis because it does not considercontrol flow, i.e., even instructions never appearing on a common controlpath may be considered as data dependent in Salto. Level 2 precision iscommon with today's instruction schedulers, e.g., the one in gcc(1) or theone used by Larus et al.(10) Current systems that do some kind of valuepropagation (we will have a closer look at other techniques for valuepropagation in Section 7) determine only address-based dependences andare thus classed at level 3. Apart from our own method, there seems to beno other method supporting level 4 precision.

We instrumented our implementation to report dependences on allfour levels. Level 4 is attained by the approach as described in Section 5.For level 3, we also use the approach from Section 5 but with the restric-tion that the predicate covers always returns false. To demonstrate theimprovement from level 3 to level 4, we summarize in Table I the resultsfor those 37 procedures from the JPEG package where we disproved morefalse dependences on level 4. [Note: These figures depend on the chosentarget machine and in particular might vary with the amount of spill codeproduced by a compiler.]

Complementary to these quantitative evaluation, we try a generalassessment of our method in the sequel. Therefore we give a characteriza-tion of the types of dependences that we can break. [We formulate thiswith respect to loops, not cycle packs, for staying compatible with thestandard literature.] As principal limit of the current state of our method,where we don't use induction variables or similar techniques yet, loop-carried dependences of a store instruction on itself can never be broken.However, loop-carried dependences between different statements arebroken by our cut predicate if the alignment of the memory blocks accessedin these statements inhibits intersect. We illustrate this claim by the followingexample.

Assume a source program working on interleaved partitions of anarray, like in red-black solving for partial differential equations. A proto-type for such problems might be the C sample program reproduced inFig. 6a. From this source code, a compiler could produce some assemblycode principally determined by the scheme sketched in Fig. 4. For instance,the full assembly code shown in Fig. 6b was produced by the gcc compiler,version 2.8.1 using optimizing level 1. Eventually applying the generalizedGCD test from Section 5.6 to the results of our symbolic value propaga-tion, we have proven that the store instruction is always independent ofany of the load instructions. This means that we can apply softwarepipelining to this loop, provided we respect the self-dependence of the storeinstruction that we were not able to disprove with our method. In fact,


Table I. Sum of True, Anti-, and Output Dependences Found onFour Levels of Precisiona

level 1 level 2 level 3 level 4 improved

lines of

procedure name code reg mem reg mem reg mem reg mem reg mem

keymatch 59 643 81 149 81 149 58 149 47 770 190

test3function 15 24 29 15 29 15 16 15 13 380 190

is�shifting�signed 33 178 38 87 38 87 31 87 22 510 290

jpeg�CreateCompress 126 4273 1945 423 1945 423 1664 423 1619 900 30

jpeg�suppress�tables 74 1127 396 143 396 143 229 143 184 870 200

jpeg�finish�compress 144 10432 2333 1121 2333 1121 2210 1121 2197 890 10

emit�byte 42 433 214 119 214 119 189 119 184 730 30

emit�dqt 125 4794 1097 575 1097 575 771 575 726 880 60

emit�dht 134 5219 1461 589 1461 589 980 589 870 890 110

emit�sof 100 6389 1282 661 1282 661 1087 661 1077 900 10

emit�sos 100 5252 1285 574 1285 574 873 574 840 890 40

write�any�marker 41 561 175 184 175 184 110 184 106 670 40

write�frame�header 142 4309 1368 679 1368 679 870 679 744 840 140

write�scan�header 86 3656 626 934 626 934 486 934 459 740 60

write�tables�only 83 2495 390 716 390 716 324 716 267 710 180

jpeg�abort 38 268 84 93 84 93 67 93 63 650 60

jpeg�CreateDecompress 124 4878 1972 507 1972 507 1716 507 1659 900 30

jpeg�start�decompress 135 4097 902 674 902 674 860 674 856 840 10

post�process�2pass 111 2583 1385 278 1385 278 907 278 878 890 30

jpeg�read�coefficients 113 3783 897 538 897 538 853 538 851 860 10

select�file�name 104 5631 1146 473 1146 473 714 473 644 920 100

jround�up 20 80 29 30 29 30 20 30 15 620 250

jcopy�sample�rows 46 354 197 115 197 115 89 115 64 680 280

read�1�byte 48 653 84 186 84 186 70 186 67 720 40

read�2�bytes 93 3115 360 555 360 555 297 555 285 820 40

next�marker 42 567 137 305 137 305 112 305 98 460 120

first�marker 84 1989 259 360 259 360 197 360 187 820 50

skip�variable 33 533 83 258 83 258 83 258 73 520 120

process�COM 107 6901 979 1147 979 1147 697 1147 592 830 150

process�SOFn 75 4545 729 670 729 670 601 670 598 850 10

scan�JPEG�header 34 804 82 306 82 306 78 306 77 620 10

read�byte 43 415 105 129 105 129 102 129 97 690 50

read�colormap 67 2221 668 305 668 305 583 305 568 860 30

read�non�rle�pixel 40 368 93 125 93 125 84 125 83 660 10

read�rle�pixel 80 976 289 268 289 268 280 268 279 730 10

flush�packet 44 468 187 131 187 131 187 131 182 720 30

start�output�tga 215 12870 3272 974 3272 974 2937 974 2876 920 20

a The results are divided into register-induced and memory-induced dependences. The tworightmost columns show the improvement of level 4 analysis on level 3 analysis, i.e., of avalue-based memory dependence analysis on an address-based memory dependence analysis.



Fig. 6. Disproving loop-carried dependences: (a) sample C code;(b) SPARC assembly program.

there is no memory dependence at all in this example, but we would onlybe able to prove this supplementary fact with the help of inductionvariables detection.

Dependences within one loop iteration can be broken in many cases.Control flow inside the loop body as well as in pre-loop code might inhibitsome breakings; in particular, we cannot always achieve the same precisionas with full linear relations over register contents. Also, breaking dependen-ces inside any basic block or between basic blocks might be inhibited forthe same reasons. We can sometimes break dependences between a loopnest and some past-loop code, in rare cases also between a loop nest andsome pre-loop code or even between different loop nests. Except for thesituation of a statement depending on itself, which can only happen in con-text of output dependences, we see no principal differences in the power ofour dependence testing with respect to the three kinds of dependences (flowdependences, anti-dependences, and output dependences).

7. RELATED WORK

Work in the field of memory reference disambiguation for assemblylanguage is beginning to emerge as a topic of research. Ellis(24) presenteda method to derive symbolic expressions for memory addresses by chasingback all reaching definitions of a symbolic register; the expression is sim-plified using elementary arithmetic and two expressions are compared


using, e.g., the GCD test. The method is implemented in the Bulldog com-piler, but it works on an intermediate level close to high-level language. Inorder to deal with loops, dummy assignments are inserted for loop counters,very similar to our initialization points.

Note that Ellis also proposed (see Chap. 5) what looks like a precursorof ,-functions, before the term was coined.(25) He suggested to stop chasingback reaching definitions at a merge point, returning instead a symbolrepresenting a set of reaching definitions. Then it would be possible in somecases to reduce the number of terms in a symbolic value set to one element,allowing a precise answer. This might be compatible with our approach,where we would insert ,-functions, and chase back beyond the merge pointonly when a nonsatisfactory answer is obtained.

Other authors inspired by Ellis were Lowney et al., (26) Bo� ckle, (27)

Moon and Ebciog$ lu.(28) The latter approach calculates alias informationfor assembly code and is implemented in the Chameleon compiler.(29) First,a procedure is transformed into DFG form (Dependence Flow Graph(30)),which provides integrated control and data flow information. For gatheringpossible register values the same technique as in the Bulldog compiler isused. If a register has multiple definitions, the algorithm described byMoon and Ebciog$ lu(28) can chase all reaching definitions, whereas theconcrete implementation in the Chameleon compiler apparently does notsupport this. Compared to our method, Chameleon preserves more pre-looprelations by applying the Banerjee inequalities(31) to the ranges of inductionvariables besides the GCD test(16) when comparing memory addresses.

Debray et al.(32) present an approach close to ours. They use addressdescriptors to represent abstract addresses, i.e., addresses containing sym-bolic registers. An address descriptor is a pair (I, M) where I is an instruc-tion and M is a set of Mod-k residues. M denotes a set of offsets relativeto the register defined in instruction I. Note that an address descriptor onlydepends on one symbolic register. So the meaning of an address descriptor(I, M) is the set of addresses w+x+k } i, where w ranges over values thatmay be computed into a register by instruction I, x # M, and i is anyinteger. The set of address descriptors is made into a bounded semi-lattice,so that a data flow system can be used to propagate values through thecontrol flow graph. [Note: In the tests, k=64 had been used.] However,this leads to an approximation of address representation that makes itimpossible to derive must-alias information. The second drawback is thatdefinitions of the same register in different control flow paths are not joinedin a set, but mapped to =. Comparing address descriptors can be reducedto a comparison of Mod-k sets, which can be done very efficiently thanksto a bit-vector representation. They illustrate the use of their method byshowing elimination of spurious loads, and perform interprocedural analysis.



Fig. 7. Example demonstrating Bodik's method.

We observed that entry points to cycle packs are precisely the placeswhere a widening operator would be placed, which is an alternative toobtaining convergence in finite time, provided one is able to define thewidening operator.(33)

Finally, let us mention the work of Bodik and Anik,(34) which��although not directly targetting the analysis of assembly code��might bea useful complement to our analysis. The idea is to start from the expres-sions of interest (for us, for instance, this could be the address expressions),and to visit the control flow graph backwards. At each node, the variableappearing on the left-hand side is substituted back into the expression(s) ofinterest. Take, for instance, the example from Fig. 7. Starting from instruc-tion 7 and its operand (r2&r1), we visit the nodes in postorder. Node 6defines a component of (r2&r1), namely r1. Back-substituting, we findthat the value of the expression is: (r2&(r2&1))=1. The technique willbe able to discover that r3 at instruction 7 can only take one value,namely 1.

There have been prior publications by us on the same topic, in par-ticular Amme et al.(35) and Braun et al.(36) The paper by Amme et al.(35)

described the project at an earlier stage. Compared to this preliminaryversion, we here use a generalized and more precisely described programmingmodel; in particular, we handle arbitrary cycles in the control flow graphnow, and memory accesses of any, possibly mixed, width. We improved theplacement rules for initialization points, and the precision of our reachingdefinitions�uses calculation and dependence test. In our more recent publi-cation, (36) emphasis is on the implementation strategy of our methodwithin the Salto tool, giving rise to an object-oriented system formonotone data flow analysis, called jSalto.

8. CONCLUSIONS

In this paper we presented a new method to detect data dependencesin assembly code. It works in two steps: First we perform a symbolic valueset propagation using a monotone data flow system. Then we computereaching definitions and uses for memory access, and derive value-based


data dependences. Note that other approaches only calculate address-baseddependences. For comparing memory references we use a modification ofthe GCD test.

Software pipelining will be one major application of the present workin the near future. This family of techniques overlaps the execution of dif-ferent loop iterations and therefore can profit from a precise dependenceanalysis. Although the dependence information our current implementationis able to deliver can already be used by software pipelining tools, we muststrive for producing dependence distances. Unrolling several iterations ofthe loop before performing a dependence analysis could be a preliminarytechnique to calculate dependence distances. However, we plan to directlytackle the problem by implementing passes that calculate loop invariantsand discover induction variables. [Note: Sharing of induction variables willalso diminish the number of artificial initialization points and thus allowmore pre-loop information to be exploited inside the loop body.] A furtherimprovement will come from deriving bounds on the contents of registers,e.g., by including the predicates of conditionals. Then coupling with knowndependence tests��such as the Banerjee test(31) or the Omega test(22)��canbe considered.

Another project is to propagate symbolic values through memory cells.Assume a register is loaded from memory. We can determine a safeapproximation S of all statements that might have contributed to theloaded value. Provided some alignment conditions, instead of representingthe loaded value by a natural initialization point we might take the meetover all values stored by statements from S. A complementary techniquecould find out whether different statements load the same value from amemory cell; the value in the target registers should then be represented bya common symbol. We hope to improve the precision of our dependenceanalysis by these propagation techniques. Solving such problems is alsoclosely related to the elimination of redundant memory instructions, as forinstance appearing in spill code.

Finally, extending our method to interprocedural analysis would lead toa more precise dependence analysis. Presently we have to assume that thecontents of almost all registers and all memory cells may have changed afterthe evaluation of a procedure call. We plan to investigate more in depth therelationship of our work to abstract interpretation.(37) We also consider toexploit assertions passed from high-level analysis to low-level analysis.(38)

APPENDIX: DERIVATION OF FORMULA cut5

See Section 5.6 for the use of this predicate and notations. Note that$ is always positive, as it is the GCD of several quantities including 2L.



cut5(ss, a$, st, b)

� _:, ; # Z, h, k # N0 , h<ss, k<st:

a$+h+: } 2L=b+k+; } 2L

� _:, ; # Z, h, k # N0 , h<ss, k<st:

p+(:&;) } 2L=k&h with p=a$&b

� _:, ; # Z, h, k # N0 , h<ss, k<st:

( p&c)+(:&;) } 2L=k&h&c with c= p(=)

� _# # Z, h, k # N0 , h<ss, k<st:

# } $=k&h&c with $=gcd(2L, gcdx # free( p)[ p(x)])

� _# # Z: &ss+1�# } $+c�st&1

Hence the conclusion:

cut5(ss, a$, st, b)=\�1&ss&c$ |�\ st&1&c

$ �+Even in sight of this proof it might be surprising that the calculation

of $ has to involve 2L. The reason is that address calculations might wraparound 2L for any number of times, i.e., strictly seen we are not calculatingin Z but in some system of the residuals modulo 2L. This phenomenon isnot restricted to assembly language programs. You might observe it, forinstance, with the sample C program from Fig. 8, provided addresses are32 bits wide. You will notice that forming $ without considering 2L as aconstituent of the GCD would miss the anti-dependence of the first loopiteration on the pre-loop code.

Fig. 8. Sample program showing the necessity to consider wrap-around during addresscalculation.


ACKNOWLEDGMENTS

This work has been supported in part through a travelling grant fromthe French Ministry of Foreign Affairs (MAE) and the German AcademicExchange Service (DAAD) under the PROCOPE program. We also areindebted to our referees for their thorough comments and their helpfuladvice.

REFERENCES

1. M. D. Tiemann, The GNU instruction scheduler. Technical Report CS 343, Free SoftwareFoundation, Cambridge, Massachusetts (June 1989).

2. S. Davidson, D. Landskov, B. D. Shriver, and P. W. Mallet, Some experiments in localmicrocode compaction for horizontal machines, IEEE Trans. Computers 30(7):460�477(July 1981).

3. J. A. Fisher, Trace scheduling: A technique for global microcode compaction, IEEE Trans.Computers 30(7):478�490 (July 1981).

4. A. Nicolau, Percolation scheduling: A parallel compilation technique, Technical ReportTR 85-678, Cornell University, Department of Computer Science (May 1985).

5. B. R. Rau and J. A. Fisher, Instruction-level parallel processing: History overview, andperspective, J. Supercomputing 7(1�2):9�50 (May 1993).

6. R. M. Stallman, Using and porting the GNU CC, Technical Report, Free SoftwareFoundation, Cambridge, Massachusetts (January 1989).

7. D. W. Wall. Systems for late code modification. In R. Giegerich and S. L. Graham (eds.),Code Generation��Concepts, Tools, Techniques, Workshops in Computing, Springer-Verlag, pp. 275�293 (1992).

8. A. Srivastava and D. W. Wall, A practical system for intermodule code optimization atlink-time, J. Progr. Lang. 1(1):1�18 (June 1989).

9. J. A. Fisher, Walk-time techniques catalyst for architectural change, IEEE Computer30(9):40�42 (September 1997).

10. E. Schnarr and J. R. Larus, Instruction scheduling and executable editing, Proc. 29th Ann.IEEE�ACM Int'l. Symp. Microarchitecture (MICRO-29), Paris, pp. 288�297 (December1996).

11. R. L. Sites, A. Chernoff, M. B. Kirk, M. P. Marks, and S. G. Robinson, Binary transla-tion, Commun. ACM 36(2):69�81 (February 1993).

12. P. Feautrier, Array expansion, Proc. Second Int'l. Conf. Supercomputing, ACM Press (July1988).

13. W. Pugh and D. Wonnacott, An exact method for analysis of value-based array datadependences, In U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua (eds.), Proc. SixthWorkshop on Languages and Compilers for Parallel Computing, Lecture Notes in ComputerScience, Springer-Verlag, Portland, Oregon, Vol. 768, pp. 546�566 (1993).

14. W. Landi and B. G. Ryder, Pointer-induced aliasing: A problem classification, Conf.Record 18th Ann. ACM Symp. Principles Progr. Lang., Orlando, Florida, pp. 93�103(January 1991).

15. D. W. Wall, Limits of instruction level parallelism, ACM SIGPLAN Notices 26(4):176�188 (April 1991).

16. S. S. Muchnick, Advanced Compiler Design and Implementation, Morgan KaufmannPublishers, San Francisco, California (1997).


17. E. Rohou, F. Bodin, and A. Seznec, Salto: System for assembly-language transformationand optimization. In M. Gerndt (ed.), Proc. Sixth Workshop Compilers for ParallelComputers, Konferenzen des Forschungszentrums Ju� lich, Forschungszentrum Ju� lich,Aachen, Vol. 21, pp. 261�272 (December 1996).

18. J. R. Larus and P. N. Hilfinger, Detecting conflicts between structure accesses, ACMSIGPLAN Notices 23(7):21�34 (July 1988).

19. A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: Principles, Techniques, and Tools,Addison-Wesley Publishing Company, Reading, Massachusetts (1988).

20. J. B. Kam and J. D. Ullman, Monotone data flow analysis frameworks, Acta Informatica7:305�317 (1977).

21. A. V. Aho, J. E. Hopcroft, and J. D. Ullman, The Design and Analysis of ComputerAlgorithms, Addison-Wesley Publishing Company, Reading, Massachusetts (1974).

22. W. Pugh, The Omega test: A fast and practical integer programming algorithm fordependence analysis, Commun. ACM 8:102�114 (August 1992).

23. M. M. Brandis, Optimizing compilers for structured programming languages, Ph.D.dissertation, Institute for Computer Systems, ETH Zu� rich (1995).

24. J. R. Ellis, Bulldog: A Compiler for VLIW Architectures, ACM Doctoral DissertationAwards, MIT Press, Cambridge, Massachusetts (1985).

25. R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadek, An efficientmethod of computing static single assignment form, Conf. Record 16th Ann. ACM Symp.Principles Progr. Lang., Austin, Texas, pp. 25�35 (January 1989).

26. P. G. Lowney, S. M. Freudenberger, T. J. Karzes, W. D. Lichtenstein, R. P. Nix, J. S.O'Donnell, and J. C. Ruttenberg, The multiflow trace scheduling compiler, J. Super-computing 7:51�142 (1993).

27. G. Bo� ckle, Exploitation of Fine-Grain Parallelism, Lecture Notes in Computer Science,Vol. 942, Springer-Verlag, Berlin (1995).

28. S.-M. Moon and K. Ebciog$ lu, A study on the number of memory ports in multipleinstruction issue machines, Proc. 26th Ann. Int'l. Symp. Microarchitecture (Micro-26),Austin, Texas, pp. 49�58 (December 1993).

29. M. Moudgill, J. H. Moreno, K. Ebciog$ lu, E. Altman, S. K. Chen, and A. Polyak,Compiler�architecture interaction in a tree-based VLIW processor, Workshop on Inter-action between Compilers and Computer Architecture '97 (in conjunction with HPCA-3),San Antonio, Texas (February 1997).

30. K. Pingali, M. Beck, R. Johnson, M. Moudgill, and P. Stodghill, Dependence flow graphs:An algebraic approach to program dependencies, Conf. Record 18th Ann. ACM Symp.Principles Progr. Lang., ACM Press, Orlando, Florida (January 1991).

31. U. Banerjee, Dependence Analysis for Supercomputing, Kluwer Academic Publishers,Boston, Massachusetts (1988).

32. S. Debray, R. Muth, and M. Weippert, Alias analysis of executable code, Proc. 25th ACMSIGPLAN-SIGACT Symp. Principles Progr. Lang. (POPL '98), ACM Press, San Diego,California, pp. 12�24 (January 1998).

33. P. Cousot and R. Cousot, Abstract interpretation and applications to logic programs,J. Logic Progr. 13(2�3):103�180 (July 1992).

34. R. Bodik and S. Anik, Path-sensitive value-flow analysis, Proc. 25th ACM SIGPLAN-SIGACT Symp. on Principles of Progr. Lang. (POPL '98), ACM Press, San Diego,California, pp. 237�251 (January 1998).

35. W. Amme, P. Braun, F. Thomasset, and E. Zehendner, Data dependence analysis ofassembly code, Proc. Int'l. Conf. Parallel Architectures and Compilation Techniques(PACT '98), Paris, pp. 340�347 (October 1998).


36. P. Braun, W. Amme, F. Thomasset, and E. Zehendner, A data flow framework for analyzingassembly code, Proc. 8th Int'l. Workshop on Compilers for Parallel Computers (CPC 2000),Aussois, France, pp. 163�172 (January 2000).

37. P. Cousot, Abstract interpretation, ACM Computing Surveys 28(2):324�328 (June 1996).38. S. Novack and A. Nicolau, A hierarchical approach to instruction level parallelization,

IJPP 23(1):35�62 (February 1995).


data dependence analysis of assembly code

Documents