2.8 fast fourier transform

INFORMATION TO USERS

This manuscript has been reproduced from the microfilm master. DM! films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer.

The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment

can adversely affect reproduction.

In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion.

Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back of the book.

Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6” x 9” black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order.

UMI'Bell & Howell Information and beaming

300 North Zeeb Road, Ann Artx)r, Ml 48106-1346 USA 800-521-0600

PERFORMANCE OPTIMIZATION OF A CLASS OF LOOPS IMPLEMENTING MULTI-DIMENSIONAL

INTEGRALS

DISSERTATION

Presented in P artial Fulfillment of the Requirements for

the Degree Doctor of Philosophy in the

Graduate School of The Ohio State University

By

Chi Chung Lam, B.S.C.LS., M.S.

* * * * *

The Ohio State University

1999

Dissertation Committee:

Professor Ponnuswamy Sadayappan, Ad- viser

Professor Dhabaleswar K. Panda

Professor Rephael Wenger

Professor Gerald Baumgartner

Approved by

AdviserDepartment of Computer and Information Science

UMI Number; 9941367

UMI Microform 9941367 Copyright 1999, by UMI Company. All rights reserved.

This microform edition is protected against unauthorized copying under Title 17, United States Code.

UMI300 North Zeeb Road Ann Arbor, MI 48103

ABSTRACT

Multi-dimensional summations, or discretized integrals, involving products of sev

eral arrays arise in scientific computing, e.g. in calculations that model electronic

properties o f semiconductors and metals. This thesis addresses the performance op

tim ization of a class of loops that implement such multi-dimensional summations.

The optimization measures considered are arithm etic operation count, communica

tion cost, and memory usage.

The goal of the operation m inimization problem is to seek an equivalent sequence

of nested loops that computes a given summation using a minimum number of arith

m etic operations. The problem is proved to be NP-com plete and an efficient pruning

search algorithm is developed for finding an optim al solution.

Due to the potentially large sizes of intermediate arrays in the synthesized optimal

solution, it is imperative to reduce the memory usage by loop fusion and loop reorder

ing transformations. We analyze the relationship between loop fusion and memory

usage and present algorithms for finding loop fusion configurations that minimize

memory usage under static and dynamic memory allocation models.

In evaluating the sums in a multi-processor environment, the partitioning of the ar

rays among processors determines the inter-processor communication overhead. The

processors are modeled as a logical multi-dimensional processor grid, with each array

to be distributed or replicated along one or more processor dimensions. A dynamic

ii

programming algorithm is developed to determine an optimal partitioning o f data and

operations among processors that minimizes the communication and com putational

costs- We also describe two approaches for determining the appropriate loop fusions

and array distributions that minimizes communication cost without exceeding a given

memory limit.

After initially developing the solutions to the various optimization problems in the

context of dense arrays, we enhance them to address the practically significant issues

o f sparsity, use of fast Fourier transforms, and utilization of common sub-expressions.

Ill

Dedicated to my wife

IV

ACKNOW LEDGM ENTS

I wish to thank my advisor, P. Sadayappan, for intellectual support, encourage

ment, and enthusiasm which made this thesis possible, and for his patience in cor

recting both my stylistic and scientific errors.

I thank Rephael Wenger, Gerald Baumgartner, and Daniel Cociorva for stim ulat

ing discussions on various aspects o f this thesis.

This research was supported in part by a grant from the National Science Foun

dation.

VITA

February 22, 1962 ..................................................Born - Hong Kong

1995 .............................................................................. B.S. Computer and Information Science, The Ohio State University

1998 .............................................................................. M.S. Computer and Information Science, The Ohio State University

1995-present .............................................................. Graduate Fellow, The Ohio State University

PUBLICATIONS

R esearch P ub lications

Chi-Chung Lam, Daniel Cociorva, Gerald Baumgartner, and P. Sadayappan. "Optim ization of memorv' usage and communication requirements for a class of loops implementing multi-dimensional integrals” . In Languages and Compilers fo r Parallel Computing, San Diego, August 1999.

Chi-Chung Lam, P. Sadayappan, Daniel Cociorva, Mebarek Alouani, and John Wilkins. “Performance optim ization of a class of loops involving sums of products o f sparse arrays” . In Ninth SIA M Conference on Parallel Processing fo r Scientific Computing, San Antonio, TX, March 1999.

Chi Chung Lam and Wu-Chi Feng. “Approximating Cumulative Bandwidth Requirements for The Delivery of Stored Video” . In Interworking, Ottawa, Canada, July 1998.

Chi-Chung Lam, P. Sadayappan, and Rephael Wenger. “On optim izing a class of multi-dimensional loops with reductions for parallel execution”. Parallel Processing Letters, 7(2):157-168, 1997.

vi

Chi Chung Lam, C.-H. Huang, and P. Sadayappan. "Optimal algorithms for all- to-ail personalized communication on rings and two-dimensional tori". Journal of Distributed and Parallel Computing, 43:3-13, 1997.

Chi-Chung Lam, P. Sadayappan, and Rephael Wenger. "Optimization of a class of multi-dimensional integrals on parallel machines". In Eighth SIAM Conference on Parallel Processing fo r Scientific Computing,, Minneapolis. MN, March 1997.

Chi Chung Lam. “An Efficient Distributed Channel Allocation Algorithm Based on Dynamic Channel Boundaries” . In International Conference on Network Protocols, Columbus, Ohio, October 1996.

Chi-Chung Lam, P. Sadayappan, and Rephael Wenger. “Optimal reordering and mapping of a class of nested-loops for parallel execution”. In Languages and Compilers fo r Parallel Computing, San Jose, August 1996.

FIELDS OF STUDY

Major Field: Computer and Information Science

Studies in:

High Performance Computing Prof. P. SadayappanData Mining Prof. Reneè MillerNetworking Prof. Raj Jain and Prof. Wu-Chi Feng

VII

TABLE OF CONTENTS

Page

A b str a c t...................................................................................................................................... ii

D ed ication ...................................................................................................................................... iv

Acknowledgm ents.................................................................................................................... v

V i t a ................................................................................................................................................ vi

List of T a b le s ........................................................................................................................... xi

List o f Figures ....................................................................................................................... xii

Chapters:

1. Introduction .................................................................................................................... 1

1.1 M o t iv a t io n ......................................................................................................... 21.2 Form of Multi-Dimensional S u m m a tio n s ................................................ 41.3 M ethodologj^ ..................................................................................................... 51.4 An Example ..................................................................................................... 81.5 Related W o rk ..................................................................................................... 141.6 Organization of this D is se r ta tio n ............................................................... 18

2. Operation M in im iza tio n ........................................................................................... 20

2.1 Problem D escr ip tio n ....................................................................................... 212.2 Formalization of the Optimization P r o b le m ......................................... 232.3 Expression Tree R epresentation .................................................................. 262.4 N P-C om pleteness.............................................................................................. 272.5 A Pruning Search Procedure ..................................................................... 322.6 Common Sub-E xpressions............................................................................ 36

viii

2.7 Sparse A rra y s ........................................................................................................ 382.8 Fast Fourier Transform ...................................................................................... 412.9 An Example ........................................................................................................ 43

3. Memory Usage M inim ization....................................................................................... 45

3.1 Introduction ....................................................................................................... 463.2 P r e lim in a r ie s ....................................................................................................... 523.3 Static Memory A llo c a t io n .............................................................................. 563.4 M emory-Optimal Evaluation Order of Unfused Expression Trees . . 64

3.4.1 Problem S ta tem en t............................................................................... 653.4.2 An Efficient A lgorithm ........................................................................ 693.4.3 Correctness of the A lg o r ith m ......................................................... 74

3.5 Dynamic Memory A l lo c a t io n ........................................................................ 813.6 Common S u b -E xpressions............................................................................... 883.7 Sparse A rra y s ....................................................................................................... 943.8 Fast Fourier Transform ...................................................................................... 973.9 Further Reduction in Memory U s a g e ......................................................... 983.10 An Example ................. , .................................................................................. 100

4. Communication M in im iza tio n ................................................................................... 105

4.1 P re lim in a r ie s ....................................................................................................... 1064.2 Application to Matrix M u ltip lica tio n ......................................................... 1094.3 A Dynamic Programming A lg o r ith m ......................................................... 1124.4 An Example ....................................................................................................... 1144.5 Common S u b -E xpressions.............................................................................. 1154.6 Sparse A rra y s ....................................................................................................... 1174.7 Fast Fourier Transform ..................................................................................... 1184.8 Communication Minimization with Memorv' C o n stra in t..................... 118

4.8.1 P re lim in a r ie s ......................................................................................... 1194.8.2 Two A p p ro a ch es .................................................................................. 122

5. C o n c lu s io n s ...................................................................................................................... 127

5.1 Research Topics for Further Pursuit ......................................................... 1295.1.1 Generalization of The Class of Nested-Loop Computations

H a n d le d .................................................................................................... 1295.1.2 Optim ization of Cache Performance .......................................... 1305.1.3 Optim ization of Disk Access C o s t .................................................. 1305.1.4 Development of an Autom atic Code G e n e r a t o r ..................... 132

IX

Bibliography ............................................................................................................................. 134

LIST OF TABLES

Table Page

1.1 Variables in an example physics com putation................................................. 3

3.1 Solution sets for the subtrees in the example.................................................. 61

3.2 Memory usage of three different traversals of the expression tree inFigure 3.8..................................................................................................................... 67

3.3 The seqsets for the fusion graph in Figure 3.12(a)........................................ 87

4.1 Optimal array distributions for an example formula sequence...................... 115

XI

LIST OF FIGURES

Figure Page

1.1 Two ways to compute the same multi-dimensional sum m ation.............. 9

1.2 Three loop fusion configurations......................................................................... 11

2.1 An example expression tree.................................................................................. 26

2.2 Different expression trees for a m ultiplication decision problem instance. 30

2.3 Expression tree transformations for the first pruning rule........................ 33

2.4 The expression tree transformation for the second pruning rule. . . . 35

2.5 Sparsity entries and sparsity graphs of arrays............................................... 40

3.1 An example multi-dimensional sum m ation and two representations ofa computation............................................................................................................ 47

3.2 Three loop fusion configurations for the expression tree in Figure 3.1. 48

3.3 Algorithm for static memory allocation............................................................ 58

3.4 Algorithm for static memory allocation, (cont.) ........................................ 59



3.7 An optimal solution for the exam ple................................................................ 63

3.8 An example unfused expression tree................................................................. 65

xii

3.9 Procedure for finding an m emor}-optimal traversal of an expression tree. 71

3.10 Optimal traversals for the subtrees in the expression tree in Figure 3.8. 73

3.11 Memory usage comparison of two traversals in Lemma 3 .......................... 78

3.12 A fusion graph with equal fusions and its two evaluation orders. . . . 83

3.13 Algorithm for dynamic memory allocation...................................................... 84

3.14 Algorithm for dynamic memorj^ allocation ( c o n t . ) .................................... 85

3.15 An example multi-dimensional summation with common sub-expressions and representations of a com putation................................................................ 90

3.16 Examples o f illegal fusion graphs for a DAG.................................................. 91

3.17 Examples o f legal fusion graphs and corresponding loop fusion configurations for a DAG ................................................................................................... 93

3.18 An example of legal loop fusions for sparse arrays........................................ 94

3.19 Illegal fusion graphs due to representations of sparse arrays.................... 95

3.20 Fused size of sparse arrays..................................................................................... 96

3.21 Loop fusions for an FFT node.............................................................................. 97

3.22 The DAG representation of an example formula sequence........................ 101

3.23 Optimal loop fusions for the example formula sequence............................. 104

4.1 An example expression tree................................................................................... 106

4.2 Expression tree representation of matrix m ultiplication............................. 109

4.3 The expression tree in Figure 3 .1(c)................................................................... 120

Xlll

C H A PT E R 1

INTR O D U CTIO N

This thesis addresses the optim ization of a class of loop computations that arise

in the implementations of m ulti-dim ensional integrals, or summations, of the product

of several arrays. Such integral calculations arise, for exam ple, in the computation

of electronic properties of semiconductors and metals [6, 24, 54]. The objective is

to minimize the execution tim e of such com putations on a parallel computer with

memory constraints. In addition to the performance optim ization issues pertaining

to inter-processor communication, there is the opportunity to apply algebraic trans

formations using the properties o f com m utativity, associativity, and distributivity, to

minimize the total number of arithm etic operations. Since the input and intermediate

arrays are often too large to fit into the available memory, loop fusion is necessary to

reduce the array sizes and hence the memory usage.

In the context of the class of loops that compute multi-dimensional summations,

we consider three optim ization problems:

1. the m inimization of arithmetic operations through the application of the alge

braic laws of associativity, com m utativity, and distributivity:

2. the minimization of memory usage by loop fusions, loop reordering, and chang

ing the order of evaluation; and

1

3. the miniinizatioa of inter-processor communication and computational costs by

appropriate partitioning of data and computations on processors.

Section 1.1 gives an example of the computational physics applications that mo

tivate this research. The form of multi-dimensional sum m ations is formalized in

Section 1.2. An overview of our approach to solving the optim ization problems is

provided in Section 1.3 and illustrated by an example in Section 1.4. Section 1.5

provides a discussion of related work. The organization of the rest of this thesis can

be found in Section 1.6.

1.1 M otivation

W ith the increase in power of supercomputers, more scientific computations be

come feasible with higher accuracies. In one class of scientific computations, the final

result to be computed can be expressed as a multi-dimensional integral of the products

of several arrays that represent some physical quantities. One computational physics

application that m otivates this research is the calculation of electronic properties of

MX materials with the inclusion of many-body eflfects [6]. MX materials are linear

chain compounds with alternating transition-metal atoms (M = Ni, Pd, or Ft) and

halogen atoms (X = Cl, Br, or I). The following multi-dimensional integral computes

the susceptibility in momentum space for the determination of the self-energy of MX

compounds in real space.

\k,G i-v*k,GXg,g'(k, ir) — z ^R"£,",R'L'

where

■ R' ",R'£,'(* ) = dr e ^ ^R'£,',R"'L"'(r) GR"£,'',R'"£,"'(z'r)V i t j ^ / / / iftf

Variable RangeRL Orbital 10-*r Discretized points in real space 10^T Time step 10-k K-point in a irreducible Brilloin zone 10G Reciprocal lattice vector 10^

Table 1.1: Variables in an example physics computation.

and

— ^R'L'(r — R') (r — R'")

In the above equations, $ is the localized basis function, G is the orbital pro

jected Green function for electrons and holes, and D is computed using a fast Fourier

transform (FFT). The interpretation of the variables and their ranges are given in

Table 1.1. After some simplifications and rewriting ^ as a two-dimensional array Y ,

the integral can be expressed as the following multi-dimensional summation of the

product of several arrays. Note that array Y is sparse since ^ is a localized function.

5 3 V[r, RL] X F[r, RL2] x V [rl, RL2>\ x V [rl, R L l] x G [R L Y RL, t]r,rl,RL,RLl,RL2,RL3

xG [RL2, RL3, £] x exp[k, r] x exp[G, r] x exp[k, r l] x exp[G l, rl]

The large sizes of the arrays could potentially lead to very high computational

costs and memory usage. For example, computing the discretized integral using the

above equations involves 3.54 x ICA arithmetic operations and would take more than

six days on a 6.4 gigaflop machine. If unoptimized, the size of the intermediate

array D would be 10 elements. Moreover, moving these large arrays among the

processors in a message-passing parallel machine could incur a large communication

overhead. Therefore, optim ization of th is kind of com putations on parallel computers

is important.

Scientists who need to calculate such multi-dimensional integrals usually attem pt

to optimize the com putation manually. T hey make decisions on implementation pa

rameters, such as which intermediate results to calculate as stepping-stones to the

final discretized integral: which loops to fuse to reduce array sizes: and how the ar

rays and the computation should be partitioned among the processors. However, the

number of possible choices for the im plem entation parameters is very large and the

manual optim izations may not yield an optim al way to compute the discretized in

tegrals. Hence, there is a need to find the optim al solution algorithmically, which is

the goal o f this thesis.

We expect the results and algorithms presented in this thesis to be applicable

not only to physics computations but also to multi-dimensional integral calculations

in other scientific areas. We have recently learnt about com putational chemistry

calculations that share similar forms to the discretized integrals addressed here. Also,

the algorithms for minimizing memory usage should be useful in other computer

applications such as register allocation or database query optimization.

1.2 Form of M ulti-Dim ensional Summations

The multi-dimensional summations (i.e., discretized integrals) we consider in this

thesis have the following general form.

^o[fo] = S X ^ 4 . 2 X . . . X .4n[/„]}n £2 ifc

where Ij represents the indices of array A j and is an ordered list of distinct indices

from the set {ii,z'a, im}- Since A q is the result of the sum m ation, its set

4

of indices Iq equals {zt+i, - - -, %m}- We assum e each index i j has a constant range

of 1 to N j and the array entries are directly referenced by a list of distinct indices.

Thus, for exam ple, we do not consider a product term such as -4i[zi,zi] or A 2 \i\ — 22]-

A m ulti-dim ensional summation in the above form can be implemented as a set of

perfectly-nested loops of the following form, with a single reduction statem ent as its

loop body.

for z'l = 1 to N i for %2 = 1 to No

for im = 1 to N m Ao[/o] = -4o[/o] + X Ao[/2] x . . . x -4„[/n]

endfor

endfor endfor

In practice, some of the arrays in the product terms occur more than once in the

multi-dim ensional summation (with each occurrence possibly associated with differ

ent index variables) and some of them are sparse. Moreover, some of the product

terms are exponential functions and thus permit some products to be computed more

efficiently using a fast Fourier transform rather than an explicit matrix-vector prod

uct. We also develop an optimization framework that appropriately models common

sub-expressions, sparsity, and fast Fourier transform (FFT) operations.

1.3 M ethodology

For the class o f loops that compute m ulti-dimensional summations, we are inter

ested in optim izing the following three performance metrics:

• the number of arithmetic operations, which can be reduced by applying alge

braic laws to form and reuse intermediate results:

• memory usage, which is affected by loop fusions and the order of evaluation;

and

• communication overhead, which depends on the distribution of data and com

putations among processors.

It would be desirable is to consider the optim ization of the three performance

metrics in an integrated fashion. This integrated optim ization problem can be stated

as: given a multi-dimensional summation of the product of a list of arrays in the form

specified in Section 1.2 and a limit on the amount o f available memory, find a parallel

program to com pute the given summation in minimum time, using no more memory

than specified. The form of a parallel program is a (possibly imperfectly-nested) loop

structure containing array evaluation statements and array redistribution statements.

However, due to the large number of possible combinations of ways to apply the

algebraic laws, to fuse the loops, and to partition the data and operations, the in

tegrated optim ization problem is extremely complex. In order to make the overall

performance optim ization problem more tractable, we view it in terms of three inde

pendent optim ization problems:

1. Given a specification of the required computation as a multi-dimensional sum

of the product of input arrays, determine an equivalent sequence of nested loops

that com putes the result using a minimum number of arithmetic operations.

2. Given an operation-count-optimal form of the com putation (determined by solv

ing the above sub-problem), apply loop transformations such as loop fusion, loop

6

nest reordering, and loop permutation to reduce the memory usage to within

the available amount o f memory.

3. Given a sequence of loop computations to be performed on each processor (from

solution of the two above sub-problems), determine the data distribution of

input, intermediate and result arrays, and the mapping of computations among

processors to minimize communication cost for load-balanced parallel execution.

Solving the three optim ization problems above may not produce an optimal so

lution to the integrated optim ization problem, but the solution produced should be

close to optimal in practice.

In minimizing arithmetic operations by applying algebraic laws, we assume that

the algebraic laws can be applied freely without jeopardizing numerical stability. If

numerical stability becomes a concern in computing some multi-dimensional summa

tions, we can disable the unsafe applications of algebraic laws to certain arrays that

are sensitive to evaluation orders.

For each of the optim ization problems, we do the following.

• We analyze and characterize the problem in terms of its solution space and the

performance metrics.

• We design an efficient algorithm for finding an optim al solution, incorporating

pruning rules and/or dynamic programming techniques to reduce the complexity

of the algorithm.

• We extend the algorithm to handle sparse arrays, common sub-expressions, and

fast Fourier transforms, which are characteristics of many practical scientific

computations.

7

• We apply the algorithm on practical physics com putations to show the effec

tiveness o f the algorithm.

1.4 An Example

As an illustration of the three optim ization problems described in the previous sec

tion, consider the following multi-dimensional sum m ation, where the array elements

are floating point numbers.

= E Z X A:, X C[A:, f]): i I

A naive way to com pute W[k] for all values of k is to have a single perfect loop nest

such as:

initialize S for i

for j for k for 1W[K]+=A[i, j] xB[j ,k,l] xC[k,l]

Computing W this way takes 3 x Ni x Nj x Nk x Ni floating point arithmetic

operations. However, we can apply algebraic laws of com mutativity, associativity, and

distributivity to rearrange the multiplications and reductions in the multi-dimensional

summation. Doing so may reduce the number o f arithm etic operations since some

index variables are m issing in some arrays. The goal o f the operation minimization

problem is to find an operation-minimal rearrangement of the m ultiplications and

reductions for a given multi-dimensional summation.

In our example, since C[k, Z] does not depend on i and we can move it out

of Y.i and XTj and get the equation and the corresponding loop structure as shown

in Figure 1.1(a). Here, / i , / 2 , and /a are intermediate arrays. By computing the

8

== E / ( E i X B\j ,kJ \) X C[kJ\) = E / E . # ' , ; ] X Y.iiB\j,kA] X C[fcj]))

for i for j

for k for 1[ f l[i, j ,k,l]=A[i, j] xB[] ,k,l]

initialize f2 for i

for j for k for 1 f 2 [k, 1] +=f 1 [i, j , k, 1]

for k for 1 f3[k,l]=f2[k,l] xC[k,l]

initialize W for k for 1W[k]+=f3[k,l]

initialize f1 for i for j [ fl[j]+=A[i,j]

for j for k for 1[ f2 [ j ,k ,l ] = B C j ,k ,l ] xCCk.l]

initialize f3 for j

for k for 1[ f3[j ,k] +=f2[j ,k,l]

for j for k f4[j ,k]=fl[j] xf3 [j ,k]

initialize W for j for kW[k]+=f4[j ,k]

(a) (b)

Figure 1.1: Two ways to compute the same multi-dimensional summation.

intermediate arrays, the number of multiplications invoking C is lowered by a factor

of Ni X N j. This reduces the number of arithmetic operations to 2 x Ni x Nj x Nk x Ni +

2 X Nk X Ni. But this is not the operation-minimal way to compute W . The optimal

form is shown in Figure 1.1(b), which takes only 2 x N j x Nk x Ni-\-Ni x Nj-\-'2 x Nj x Nk

operations and represents an order of magnitude improvement.

In general, there are many ways to rearrange the multiplications and reductions

in a given multi-dimensional summation and they result in different number of arith

metic operations. Since finding the operation-minimal rearrangement is not trivial,

an automated procedure for determining the optimal solution is needed.

Once the operation-minimal rearrangement is determined, the next step is to

implement it as some loop structure. A straightforward way to generate a sequence

of perfect loop nests, each evaluating an intermediate array, such as in Figure 1.1.

However, the intermediate arrays in practical scientific applications could be too large

to fit into memory. The reduction of array sizes by reordering and fusing the loops is

desirable.

Consider the operation-minimal form and the corresponding loop structure shown

in Figure 1.1(b). For now, we assume the input arrays can be generated one element

at a time (by the genu function for array v). Figure 1.2(a) shows the loop structure

that includes the loop nests for generating the input arrays. Note that no loop fusion

has occurred. Under static memory allocation, the total memory usage of the loop

structure is the total size of the arrays, which is 2 x N j x Nk x A) -h 2 x N j x A t +

Ni X N j -f- Nk X A 4- Nj 4- Ajfc.

Suppose we fuse all the loops around the evaluations of A and / i , all the loops

around those of B , C, /g , and /a , and also all the loops around those o f A and

10

for i for j[ ACi,j]=genA(i,j)

for j for k

for 1[ B[j,k,l]=genBCj.k,l)

for k for 1[ C [k. 1] =genC (k. 1)

initialize fl for i

for j[ fl[j]+=A[i.j]

for j for k

for 1[ f2 [ j ,k , l ] = B [ j ,k . l ] x C [ k , l ]

initialize f3 for j

for k for 1[ f3Cj,k]+=f2[j,k,l]

for j for k[ f4[j ,k]=f iCj] xf3[j ,k]

initialize fS for j

for k[ f5[k]+=f4[j.k]

initialize fl for i

for jr A=genACi.j)L flCj]+=A

initialize f3 for k r for 1

C=genC(k,l) for j

B=genB(j,k,l) f2=BxC f3[j.k]+=f2

initialize fS for j

for kr f4=fl[j]xf3[j.k]I f5[k]+=f4

for k for 1[ C[k,l]=genCCk.l)

initialize f5 for j ■ for i

[ A[i]=genA(i,j)initialize fl for i [ fl+=A[i] for k

initialize f3 for 1

B=genB(j,k,l) f2=BxC[k,l] f3+=f2

f4=flxf3 fSCk]+=f4

(a) (b) (c)

Figure 1.2: Three loop fusion configurations.

11

/a (after some suitable loop reordering). The resulting loop structure is shown in

Figure 1.2(b). Once a loop (say, a (-loop) around the evaluation of an array v and the

consum ption of the same array is fused, the (-dim ension of the array v is no longer

necessary- and can be eliminated. This reduces the size of the array v by a factor of

Nt- If all the loops around the evaluation and the consum ption of the same array are

fused, the array can be replaced by a scalar variable. Thus, .4, B , C, /?, and / t are

reduced to scalars.

Figure 1.2(c) shows another possible way to fuse the loops. Here, we first fuse all

the j-loops and then fuse all the A:-loops and /-loops inside them . Doing so reduces

the sizes o f all arrays except C and f^. By fusing the j - , k-, and /-loops around the

evaluation and the consumption of those arrays, the j - , k-, and /-dimensions of those

arrays are eliminated. Hence, B, f i , /g, fs , and / t are reduced to scalars while .4

becomes a one-dimensional array.

However, in many cases, we cannot fuse all the loops and reduce all arrays to

scalars. Instead, loop fusions could be mutually exclusive; fusing one loop may prevent

the fusion o f another. For example, in Figure 1.2(b), w ith appropriate loop reordering,

we can further fuse either the y-loops around the evaluation and the consumption of

f l or the A:-loops around the evaluation and the consum ption of /a, but not both.

Determ ining a set of loop fusions and loop reordering that minimizes overall memory

usage is the objective of the memory usage m inim ization problem.

In implementing a nested loop structure on a message-passing parallel computer,

the partitioning of the arrays and computation am ong the processors has to be de

cided. Assume that no loop fusion has taken place so that arrays are in their full sizes.

We take a logical multi-dimensional view of the processors so that each array can be

12

distributed or replicated along one or more of the processor dimensions. Suppose

the processors are viewed as a two-dimensional array. In the running example, one

of many possible choices is to distribute the j-dim ension of array B (referenced by

B[j, k, /]) along the first processor dimension and the A;-dimension along the second

processor dimension. Similarly, we can distribute the A:-dimension of array C (refer

enced by C[A:,Z]) along the first processor dimension and the /-dimension along the

second processor dimension. Before evaluating fo, operand arrays B and C need to

be aligned so that operand elements of B and C are on the same processor as the local

portion of /2 to be formed. One way to align B and C is to redistribute C so that its

fc-dimension is distributed along the second processor dimension and its distributed

portions are replicated along the first processor dimension. Then, multiplications can

take place on the processors and the product array /a will have the same distribution

as B. Alternatively, we can redistribute B so that it has the same distribution as C

and the product array will have the same distribution as C . The communication

cost incurred in redistributing C using the first alternative may be different from the

cost in redistributing B using the second alternative. Moreover, depending on how

/2 is used in evaluating /a, the resulting distribution of /2 with one alternative may

be better than with the other alternative.

Given a nested loop structure, there are a large number of ways to partition

the arrays and computation among the processors. They could vary significantly in

terms of communication and computational cost. The goal of the communication

minimization problem is to find a partitioning for the arrays and computation that

minimizes the total communication and com putational cost.

13

1.5 Related Work

Reduction of arithmetic operations has been traditionally done by compilers us

ing the technique of common sub-expression elimination [17]. Potkonjak et al. [50]

considered the multiple constant m ultiplication problem, in which a large number

of multiplications of one variable with several constants need to be performed. The

number of shifts is first minimized, and then the number of additions is minimized us

ing an iterative pairwise matching heuristic for common sub-expression elimination.

Their problem formulation can be applied to other high level tasks such as linear

transforms and single and multiple polynomial evaluations. Here, we are minimizing

the total number of multiplications and additions.

Lu and Chew [43] presented a fast algorithm to solve for the scattered field of

a two-dimensional, dielectric-coated conducting cylinder using a hybrid of combined

field surface integral equation and volume integral equation. The algorithm relies on

the translation of scattering centers to speed up the matrix-vector multiplication in

the conjugate gradient method. An efficient approach is used to reduce to floating

point operations from 0(iV"-) to 0{N^'^).

Winograd [65] addressed the general problem of evaluating multiple expressions

that share common variables using the minimum number of arithmetic operations.

This problem was motivated by Strassen’s algorithm for matrix multiplication. How

ever, we do not consider Strassen’s algorithm as an array expression evaluation

method that can be synthesized automatically.

Miller [48] suggested several analytical and numerical techniques for reducing the

operation count in computational electromagnetic applications. These techniques in

clude reducing the spatial-sample count, filling impedance matrix using model-based

14

parameter estim ation, using fast Fourier transforms in iterative solutions, applying

near-neighbor approximations, reducing the number of near-neighbor coeS cients, and

so on. These and similar techniques are useful in reducing the com plexity in mod

eling physical system s. Some of them have already been used in forming the multi

dimensional sum m ation that models electronic structures in Section 1.1.

Loop transformations that improve locality and parallelism have been studied ex

tensively in recent years. For instance. W olf and Lam [67] formulated cache reuse

and locality m athem atically and presented a loop transformation theory that unifies

various transforms as unimodular m atrix transformations. Based on their formula

tion, they proposed an algorithm for improving the locality of a loop nest by loop

interchange, reversal, skewing and tiling. Kennedy and McKinley [28] explored the

tradeoff between data locality and parallelism. They presented a memory m odel to

determine cache line reuse from both multiple accesses to the same memory location

and from consecutive memory access. However, we are unaware of any work on loop

transformation based on the distributive law as a means of minimizing arithm etic

operations.

Much work has been done on improving locality and parallelism by loop fusion.

Kennedy and McKinley [29] presented a new algorithm for fusing a collection of loops

to minimize parallel loop synchronization and maximize parallelism. They proved

that finding loop fusions that maxim izes locality is NP-hard. Two polynom ial-tim e

algorithms for improving locality were given.

Singhai and McKinley [58] exam ined the effects o f loop fusion on data locality

and parallelism together. They viewed the optim ization problem as a problem of

partitioning a weighted directed acyclic graph, in which the nodes represent loops

15

and the weights on edges represent amount o f locality and parallelism. Although the

problem is NP-hard, they were able to find optim al solutions in restricted cases and

heuristic solutions for the general case.

However, the work in this dissertation considers a different use o f loop fusion,

which is to reduce array sizes and memory usage o f autom atically synthesized code

containing nested loop structures. Traditional compiler research does not address

this use of loop fusion because this problem does not arise with manually-produced

programs.

Gao et al. [18] studied the contraction of arrays into scalars through loop fusion

as a means to reduce the overhead of array accesses. They partitioned a collection

of loop nests into fusible clusters using a max-fiow min-cut algorithm, taking into

account the data dependence constraints. However, their study is motivated by data

locality enhancement and not memory reduction. Also, they only considered fusions

of conformable loop nests, that is, loop nests that contain exactly the same set of

loops.

Loop fusion in the context of delayed evaluation o f array expressions in compiling

APL programs has been discussed by Guibas and W yatt in [22]. They presented an

algorithm for incorporating the selector operators into the accessors for the leaf nodes

of a given expression tree. As part o f the algorithm, a general buffering mechanism is

devised to save portions of a sub-expression that will be repeatedly needed, to avoid

future recomputation. They considered loop fusion without any loop reordering, and

it is also not aimed at minimizing array sizes. We are unaware of any work on fusion

and reordering of multi-dimensional loop nests into im perfectly-nested loops as a

means to reduce memory usage.

16

A simpler problem related to the memory usage m inimization problem is the

register allocation problem in which the sizes o f all nodes are unity and loop fusion is

not considered. The goal is to find an evaluation order o f a given binary expression

tree that minimizes the number of registers needed. It has been addressed in [49, 57]

and can be solved in 0 { n ) time, where n is the number of nodes in the expression

tree. For each parent node in the expression tree, the strategy is to evaluate the

child subtree that uses more registers before evaluating the other child subtree. But

if the expression tree is replaced by a directed acyclic graph (in which all nodes are

still of unit size), the problem becomes NP-com plete [56]. The algorithm in [57] for

expression trees of unit-sized nodes does not extend directly to expression trees having

nodes of different sizes. Appel and Supowit [5] generalized the register allocation

problem to higher degree expression trees of arbitrarily-sized nodes. However, the

problem they addressed is slightly different from ours in that, in their problem, space

for a node is not allocated during its evaluation. Also, they restricted their attention

to solutions that evaluate subtrees contiguously, which is sub-optim al in some cases.

Several researchers have investigated issues pertaining to autom atic mapping of

data and computation onto parallel machines. Chatterjee et al. [10, 11] considered

the optimal alignment of arrays in evaluating array expression using data-parallel

languages such as Fortran 90 on massively parallel machines. Alignment functions

are decomposed into axis, stride, and offset. They presented dynamic programming

algorithms that solve the optimal alignment for several communication cost metrics:

multi-dimensional grids and rings, hybercubes, fat-trees, and the discrete metric.

However, they do not consider distribution and replication o f arrays.

17

Anderson and Lam [4] developed global loop transformations under a linear al

gebraic framework that handles ver\' general loop forms. They described a compiler

algorithm that autom atically finds computation and data decom positions that op

timize both parallelism and locality. The algorithm models both distributed and

shared address space machines. But it only handles dense matrices where the array

subscripts are affine functions of the loop indices.

Gupta and Banerjee [23] have developed a constraint-based heuristic strategy"

for automatic data distribution in the PARADIGM compiler. The compiler makes

data partitioning decisions for Fortran 77 procedures. They decomposed the data

partitioning problem into a number of sub-problems, each dealing with a different

distribution parameter for all the arrays, and presented algorithms that determine

those parameters.

The communication minimization problem we consider differs from these related

works in that we address a more restricted form of the data/com putation mapping

problem, but evaluate many more data/com putation mapping possibilities, including

data/com putation replication and/or distribution on a subset of processors.

1.6 Organization of this Dissertation

The rest of this dissertation deals with the three optim ization problems, namely

operation minimization, communication minimization, and memory usage minimiza

tion. Each chapter addresses an optimization problem by first considering the case

where all arrays are dense and the computation can be modeled as an expression tree

with sum and m ultiply operators. The solutions are then extended to include sparse

18

arrays, use of fast Fourier transforms, and the utilization of common sub-expressions

(which results in directed acyclic graphs instead of expression trees).

Chapter 2 describes the operation m inim ization problem and shows by an example

how the implemented pruning search algorithm finds solutions that are better than the

best manually-optimized solutions. The operation minimization algorithm has been

implemented and has been used to obtain significant improvement in the number of

operations for self-energy electronic structure calculations in a tight-binding scheme.

In Chapter 3, we study the use o f loop fusion to reduce intermediate array sizes and

present algorithms for finding the optim al loop fusion configuration that minimizes

memory usage. Both static and dynamic memory allocation models are considered.

Chapter 4 considers the optim al partitioning of data and loops to minimize com

munication and computational costs for execution on parallel machines. Two ap

proaches to minimizing communication and com putational cost on parallel computers

under memory constraint are also described.

Chapter 5 provides conclusions and describes some research topics that may be

pursued in the context of optim izing multi-dim ensional summation calculations.

19

C H A PTER 2

OPERATION M INIMIZATION

In the class o f computations considered, the final result to be computed can be

expressed as multi-dimensional summations of the product of many input arrays. Due

to com m utativity, associativity, and distributivity, there are many different ways to

obtain the same final result and they could differ widely in the number of arithmetic

operations required. The problem of finding an equivalent form that computes the

result with the least number of operations is not trivial and so a software tool for

doing this is desirable.

This chapter is organized as follows. Section 2.1 explains this operation mini

m ization problem with an example. Section 2.2 formalizes the operation minimiza

tion problem in terms of a sequence of formulae that compute the same result as

the original multi-dimensional summation. Section 2.3 describes an expression tree

representation of a formula sequence. Section 2.4 proves that the problem of opera

tion minim ization is NP-com plete. A pruning search procedure for finding a formula

sequence that minimizes the number of operations is developed in Section 2.5. The

extensions of the search procedure to handle common sub-expressions, sparse arrays,

and FFT are described in Sections 2.6, 2.7 and 2.8, respectively. An example of its

application is given in Section 2.9.

20

2.1 Problem Description

Consider, for example, the following multi-dimensional summation

^ H ^ 3-. X B[j, k, (]) ' j fc

A direct way to implement the summation would be;

initiaJ-ize S for i

for j for k for t[ S[t]+=A[i,j,t] xB[j,k,t]

Execution of this program fragment involves N{ x Nj x Nk x Nt floating point

multiplications and an equal number of floating point additions, resulting in a total

of 2 X Ni X Nj X Nk X Nt operations. If the above loop were input to an optimizing

compiler, it would perform dependence analysis [68] on the loop and determine that

the innermost t-loop was an independent loop and that the other three loops involved

dependences due to reduction operations. A lthough the loop could be parallelized, no

attem pt would be made by the compiler to reduce the number of arithmetic operations

involved. As shown below, a considerable saving in the number of operations is in fact

possible for this computation through application of algebraic properties of addition

and multiplication.

Since addition and multiplication can both be considered associative and com

mutative (although floating-point operations are not strictly associative, vectoriz

in g / parallelizing compilers generally treat these operations as acceptably associative),

and multiplication distributes over addition, we have:

1.

21

2. If term X does not depend on i, then Y^i{X x Y ) = X x Y

The first rule allows us to reorder the positions o f any number o f consecutive sum

mations while the second rule permits the extraction of an expression independent of

a summation index out of that sum m ation. By application of the algebraic properties,

we can re\\Tite the summation as:

(]) X B[j\ k, i]))j i k

which can be transformed into the following program fragment:

initiaiize Tempi for i

for j for t[ Tempi [j ,t] +=A [i, j , t]

initiauLize Temp2 for j

for k for t[ Temp2[j ,t]+=B[j ,k,t]

initialize S for j for tS [t] +=Templ [j , t] xTemp2 [j , t]

The new program fragment requires only Nj x Nt floating point multiplications

and Ni x Nj x Nt + Nj x Nk x N + N j x Nt floating point additions. The total number

of floating point operations, which is Ni x Nj x Nt + Nj x Nk x Nt + 2 x Nj x iVj,

is an order of magnitude less than that of the original program fragment. Although

Tempi and Temp2 are two-dimensional arrays, they can be reduced to scalar variables

by doing loop fusions. The optim ization o f memory usage via loop transformations

will be addressed in Chapter 3.

The above example is simple enough to be able to manually seek the optimal

index reordering and application o f the distributive law to m inim ize the number of

22

operations. However, the complex sequence of such summations that arise in some

com putational physics applications are not easily hand-optimized. Thus, a software

tool for doing this is desirable.

2.2 Formalization of the Optim ization Problem

Generalizing from the example of the previous section, the problem addressed is

the minimization of arithm etic operations of a given multi-dimensional summation.

We are interested in deriving an autom atable strategy for operation reordering and

application of the distributive law to reduce the amount of arithm etic required. How

ever, autom atic generation of transformations such as those needed to transform the

standard matrix m ultiplication algorithm into Strassen’s algorithm are clearly beyond

the scope of what we believe is feasible.

Hence, we first have to define more precisely the space of equivalent programs that

are to be searched am ongst. We formalize this space as a set of formula sequences.

Each formula in a formula sequence is either;

• a multiplication formula of the form: fr[- - ■] — A'’[ . ..] x Y [ . ..], or

• a summation formula of the form: f r [ . ..] = ITi A [ . ..]

where X (and Y ) is either a product term in the given multi-dimensional summation

or a previously defined function fs-

Let X .d im e n s denotes the sets o f indices in A [ . ..]. For a formula to be well-

formed, every index in X and Y , except the summation index in the second form,

must appear inside f r [ . . .]. Thus, fr -d im ens = X .d im e n s U Y.d imens for any mul

tiplication formula, and fr -d imens = X .d im en s U {z} for any summation formula

23

with summation index i. Each formula in a sequence computes a partial result of the

summation and the last formula produces the final result desired. Such a formula

sequence fully specifies the m ultiplications and additions to be performed in com

puting the result, and it is straightforward to generate loop code corresponding to a

particular formula sequence.

For example, the summation S[t] = x B [ j ,k , t \ ) can be repre

sented by the formula sequence below:

/ i [z, j , k, t] = j , t] X B[j, k, t]

f2[ij,t] = J2Mhj;k.,t]k

JS[t] = I^ /3 [z ,i]

i

whereas the optimized form 5[£] = .4[z, j , £]) x {J2k B[j, k, t])) corresponds to

the sequence:

i

f2[j, t] = ^ £]k

‘5'W = 5 2 /sD',j

The cost, or total number of floating point operations, of a formula sequence is the

sum of the costs of the individual formulae in the sequence, which can be obtained

as below:

24

• For a multiplication formula of the form /r[---] = % [...] x the cost

is Uhex.dimensuY.dimens For iustauce, the formula = A [ i , j , t ] x

B[j, k, t] has a cost of Ni x Nj x Nk x Nt.

• For a summation formula of the form /r[---] = the cost is {Nt —

1) X n/ieA'.*mens-{i} Ah- The term Ni — 1 comes in because adding N numbers

requires Ni — 1 additions. But for the sake of simplicity, we may approximate

this cost as AQ. For example, the cost of the formula foihJA] = H k f i [ h j \ k , t ]

is {Nk — 1) X iVj- X Nj X Nt or sim ply Ni x Nj x Nk x Nt-

A well-formed formula sequence computes a given multi-dimensional summation

if it satisfies the following conditions:

• Each product term in the given summation appears exactly once among the

sequence of formulae.

• Each summation index in the given summation is summed over in a single

formula.

• No summation index appears in any formula subsequent to summation over

that index.

• Each defined function f r [ . ..], except the last one, appears exactly once among

the formulae subsequent to its definition.

Hence, a valid sequence must contain exactly n — 1 multiplication formulae and k

summation formulae, where n is the number of product terms in the given summa

tion and k is the number of summation indices in the given summation. W ith the

above formalization, the operation minimization problem can be restated as finding

25

s J2j

/ s X

f l STi /2 'èk

A[i, j , t] B\j ,kA]

Figure 2.1: An exam ple expression tree.

a formula sequence that computes a given multi-dimensional summation and incurs

the least cost, i.e. requires a minimum number of floating point operations.

2.3 Expression Tree Representation

A formula sequence can be represented graphically as an expression tree to show

the hierarchical structure of the com putation more clearly. As an example, the expres

sion tree that represents the above optim ized formula sequence is shown in Figure 2.1.

In an expression tree, the leaves are the product terms in the summation and the

internal nodes are the f r [ . ..] defined by the formulae, with the last formula at the

root. An internal node may either be a m ultiplication node or a summation node.

A multiplication node corresponds to a m ultiplication formula and has two children

which are the terms being multiplied together. A sum m ation node corresponds to a

summation formula and has only one child, representing the term on which sum m ation

is performed.

26

2.4 NP-Com pleteness

In this section we prove that the decision version o f the operation minimization

problem formalized in the previous section is NP-com plete. To show this, we identify

a simpler sub-problem: the multiplication sub-problem. Proving the decision version

of this sub-problem to be NP-com plete will prove the decision version of the operation

minim ization problem to be NP-complete.

The m ultiplication sub-problem is one where no sum m ation indices are present at

all. Given a set o f array variables, they are to be multiplied together in some order so

as to minimize the total m ultiplication cost. Note that m ultiplication in this context is

comm utative and the variables can be rearranged and grouped in any order. Thus the

polynomial tim e dynamic programming algorithm for the matrix-chain multiplication

problem does not generalize to our problem.

To illustrate the m ultiplication sub-problem, let us consider the example

.4 [z] X B[j] X C[i, X D[j, k]

where jV, = iVj = 10 and = 20. One way to perform the m ultiplications is

which requires Ni x Nj -t- 2 x N x Nj x Nk = 4100 arithm etic operations. However,

this is not optimal; the optim al order of multiplication is

(A M x C [z \A :])x (g |;]x D [7 ,A ;])

which requires only Ni x Nk -i- Nj x Nk -f N x Nj x N t = 2400 arithmetic operations.

Note that the cost of each node in the expression tree representing the order of

27

multiplication is equal to the product of the sizes o f the indices of the array represented

at that node. The m ultiplication cost of the root in the expression tree is fixed and

independent of the order o f multiplication.

The multiplication sub-problem can be formally stated as follows. Given a finite

set / , a size N{ G for each i G I, and a family T of subsets of I such that

I = \J se r^ - Slid a binary tree T whose leaf nodes are the sets in JF that minimizes

IZ„e7’-:Fc('u), where T — !F \s the set of internal nodes (including the root node) in

T, c{v) = is the cost of node v, J(v) = Us€£)(u) S, and D(v) is the set

of leaf nodes in the subtree rooted at v. The corresponding multiplication decision

problem asks whether there is a solution T that costs no more than a given integer

K . The decision problem is in NP because an expression tree can be generated non-

deterministically, and it can be checked in polynomial time whether it costs no more

than K .

The operation minimization problem can be stated as follows. Given a multi

dimensional summation in the form described in Section 1.2, find an expression tree

that specifies the evaluation order of the multiplications and summation and involves

the number of operations. The corresponding decision problem asks whether there is

an expression tree that evaluates a given multi-dimensional summation and costs no

more than a given integer K .

We prove the NP-completeness of the multiplication decision problem in two steps.

First, we reduce a known NP-complete problem, the Subset Product problem, to

a new problem that we call the Product Partition problem. Next, we reduce the

28

Product Partition Problem to the m ultiplication decision problem. This proves NP-

completeness of the multiplication decision problem as well as the NP-completeness

of the decision version of the operation minimization problem.

The Subset Product problem can be defined as follows. Given a finite set .4, a

size s(a) G for each a € .4, and a positive integer y, determine whether there

exists a subset 4' Ç .4 such that Ilae.-i' ~ V- This problem is known to be NP-

complete [20]. The Product Partition problem is similar. Given a finite set B and

a size s'(b) G for each b E B, the Product Partition problem asks whether there

exists a subset B' Ç. B such that HaeB' — libeB-B' a'(6).

Let X = riaeA ^(a), where (A, s, y) is an instance of the Subset Product problem.

Note that if 2 x j y is not an integer, then there is no solution to this instance. Other

wise, reduce the Subset Product problem to the Product Partition problem by adding

two new elements of sizes 2 x / y and 2y to the set A. Formally, for each Subset Product

problem instance ( A , s , y ) , form a Product Partition problem instance ( 5 , s') where

B = A U {b', 6"}, s'(a) = s(a) for all a G -4, 6' ^ A, b" 0 .4, s'(6') = 2 x /y , s '(6") = 2y,

and X = riaeA s(n). If the Subset Product problem instance ( 4, s, y) has a solution

A', then B' = A' U 6' is a solution to the Product Partition problem instance (S , s')

since fldeB' ~ UbeB-B’ '{b) = 2x. Conversely, if the Product Partition problem

instance (B , s') has a solution B', then either one of b' or 6" (but not both) must

belong to B' because b'b" = 4x is greater than H ies' ^'(6) = a'(6) = 2x. In

this way, either A' = B — B' — (6'} or A' = B ' — {6'} would be a solution to the Subset

Problem instance (A, s, y). Since this reduction can be performed in polynomial time,

it follows that the Product Partition problem is NP-complete.

29

L^x X L^x ^

L y / x X X L y / x L y X X L x / y x x z

M^y/x X Xy/xMf/r X X X X X X

A A(a) (b) (c)

Figure 2.2: Different expression trees for a m ultiplication decision problem instance.

Given a Product Partition problem instance {B, s'), we construct a m ultiplication

decision problem instance as follows. Let x = IlieB and note that if y /x is

not an integer, then the instance has no solution. Otherwise, reduce the Product

Partition problem to the m ultiplication decision problem as follows. First, for each

b E B, form a one-dimensional array M(j[ib\ where = s'{b). Then, add two more

arrays and where Ni^, = Ni^, = L = {n + l ) \ / x , n = \B\, and

X = riaeB s '(6). This reduction can be done in polynom ial time. Arrays My and My/

are so large that their relative position in the expression tree becomes very significant.

These n + 2 arrays, namely A/y, A /y and Mf, for each b E B, and a maximum cost

K = L^x + 2 L y / x + {n — 2) y / x define the m ultiplication decision problem instance.

We shall prove that the Product Partition problem instance {B, s') has a solution if

and only if the multiplication decision problem instance has a solution.

We use two facts about the m ultiplication decision problem. First, the cost of the

root node is the same in all solutions and is the product of the sizes of all indices,

in this case L^x. Second, the cost at any node is bounded by the cost of any of its

ancestors. Thus the cost of any solution is bounded by (n + l)L^x.

30

If the Product Partition problem instance {B, s') has a solution, then we can

construct an expression tree T for the m ultiplication decision problem instance so

that two non-sibling third-level nodes have equal cost \ / x and the two second-level

nodes at the second level have equal cost L y / x , as shown Figure 2.2(a). We establish

an upper bound on the cost o f T as follows. Since the n + 2 arrays require n 4- 1

m ultiplications, there are n - f l —3 = n —2 m ultiplication nodes below the second level

in the expression tree. Thus, T has a cost of at most K = L'^x -h 2 L y / x + {n — 2) y /x

and hence is a solution to the m ultiplication decision problem instance.

Conversely, if the m ultiplication decision problem instance has a solution T (that

costs no more than K = L '^x - \ -2 L y / x - \ - { n — 2 ) y / x ) , we argue that the two second-level

nodes of T (i.e. the two children of the root node of T) must have equal costs. The

reason is that if a solution to the multiplication decision problem instance has unequal

cost on the two second-level nodes, say L y on one node and L x ! y on the other node,

where y / y / x , then the expression tree would have a cost o f at least L / x + L y - ^ - L x j y .

This is greater than K = L r x 4- 2L y / x -f- (n — 2 ) y / x , since y + x f y > 2 y / x 4- 1 and

L > [ n — 2 ) y / x (see Figure 2.2(b)). Moreover, M y and My, must be in the two

separate subtrees rooted at the two second level nodes of T. If Afy and My, are not

split at the second level, the expression tree would have a cost of at least L ^ x + L~ + x

which is greater than K = L / x 4- 2L y / x + {n — 2 ) y / x (see Figure 2.2(c)). Removing

M y and My, from the leaf nodes in these two subtrees gives us a solution to the

Product Partition problem instance

It follows that the m ultiplication decision problem is NP-com plete. Since every in

stance of the m ultiplication sub-problem is an instance o f the operation minimization

problem, the decision version of the operation m inim ization problem is NP-complete.

31

2.5 A Pruning Search Procedure

Since the operation minimization problem has been shown to be NP-complete, it

is impractical to seek a polynomial-time algorithm for it. We have to resort either

to heuristics or use exponential-tim e searches for the optim a. For the kind of multi

dimensional summations that arise in practice, the number of nested loops and the

number of array variables is typically less than ten. Thus, a well-pruned search

procedure should be practically feasible. We pursue such an approach here.

The following procedure can be used to exhaustively enumerate all valid formula

sequences:

1. Form a pool o f the product terms in the given multi-dimensional summation.

Let Xa denote the a-th product term and Xa-dimens the set o f index variables

in X a[...]. Set r to zero.

2. Increment r. Then, perform either action:

(a) Write a formula /r[---] = Xq[...] x X&[...] where Xq[...] and X&[...] are any

two terms in the pool. The indices for fr are fr-dimens = Xa..dimens\J

Xb-dimens. Replace Xa[...] and X(,[...] from the pool by / r [ . . .].

(b) If there exists an summation index (say i) that appears in exactly one term

(say Xa[...]) in the list, increment r and create a formula /r[--.] = Xa[...j

where fr-dimens = Xa-dimens — {%}. Replace Xq[...] in the pool by fr[- - -]-

3. When step 2 cannot be performed any more, a valid formula sequence is ob

tained. To obtain all valid sequences, exhaust all alternatives in step 2 using

depth-first search.

3 2

E , E , E ; E ,I I I I .

- A r / X -Eg Z [ i , j , . . . \ Z [ i , j , . . . \ Z i[ i , . . -1 Z2[...J Zi

V'[i\...l y[:,.

(a) (b) (c)

Figure 2.3: Expression tree transformations for the first pruning rule.

The enumeration procedure above is inefficient in that a particular formula se

quence may be generated more than once in the search process. This can be avoided

by creating an ordering among the product terms and the intermediate generated

functions (which can be treated as new terms, numbered in increasing order as they

are generated).

-A. further reduction in the cost of the search procedure can be achieved by pruning

the search space by use o f the following two rules:

1. When a summation index appears in only one term, perform the summation

over that index immediately, without considering any other possibilities at that

step.

2. If two or more terms have exactly the same set of indices, first multiply them

together before considering any other possibilities.

It can be proved that the use of the pruning rules will not change the cost of the

optimal formula sequence found. To prove the first pruning rule, suppose Y is the

only term in the pool that contains a summation index i and we choose to multiply

V with another term instead of summing it over i. A formula sequence obtained in

3 3

this way can be represented by an expression tree T in which i does not appear in

any node outside the subtree rooted at Y and the Yli node is an ancestor but not

the parent of Y . We show that T can be transformed into a similar expression tree

T' where the Xli node is the parent o f Y w ithout increasing the cost. Figure 2.3(a)

illustrates this transformation, which is done by moving the node towards Y' one

step at a time until it becomes the parent o f Y . If the child of the XTi node is another

sum m ation node (say J^j). we can interchange the positions o f the two summation

nodes, as shown in Figure 2.3(b), w ith no effect on the cost because both

and 5Zy(Z!i ■2') have the same cost o f (Ni x Nj — 1) x nkez.dimens-iij} ^k- If the child

of the node is a m ultiplication node with two children Z\ and Z2 where Zi is Y' or

an ancestor o f Y , then we can move the node to become the parent of Zi and a

child of the m ultiplication node (see Figure 2 .3(c)). This move affects the costs o f the

node and the m ultiplication node only. The cost of the sum m ation node cannot

increase because Zi-dimensLi Zo.dimens 3 Zi.dimens. The cost o f the multiplication

node is reduced by a factor o f Ni since m ultiplying Zi and Zo involves one more index,

which is 2 , than multiplying Zi and Zg.

To prove the second pruning rule, suppose Fi and Fo are two terms in the pool

that have the same set of indices and we choose to multiply Y\ by another term Z

where Z.dimens ^ Y^.dimens. A formula sequence thus obtained can be represented

by an expression tree T in which Fi and F are the roots of two separate subtrees

and Fi and Z are children of a m ultiplication node, as shown Figure 2.4(a). We can

transform T into a similar expression tree T' by moving the subtree rooted at Fi to

becom e a sibling of }g, as shown Figure 2.4(b). This transformation only affects the

cost o f the nodes along the paths from Fi and Fg to their common ancestor. Let

34

f r + s f r + s

f r + a fr -{ -s—l f r + a f r - ^ s —l

f r - ^ i f r - td - h i fr - i- i ./^r-ra-f-1

/ \ / \ / r ? 2 2

Yi z n %(a) (b)

Figure 2.4: The expression tree transformation for the second pruning rule.

those nodes be labeled as shown in the figure, where 0 < a < s. We show that the

transformation does not increase the total operation count by comparing the costs of

corresponding nodes in T and T' as follows.

• cannot cost more than /r because fr-indices D Yi.dimens and f^..indices =

Y\.dimens.

• fr-indices D Z.dimens implies for all 0 < /: < a, fr+k-indices D /^^/..indices and

thus costs less than or equal to fr+k-

• fr-indices = Yy-dimens = Y-z-dimens implies for all a + 1 < k < s — 1,

fr-y-k-indices = f'^_ f.-indices and hence f^^f. and /r+t have the same cost.

• Since fr-y-a-indices D fl^^.indices and fr-y-s-i-indices = f^^^_y-indices, we have

fr-y-s-indices D fl-_ _ -indices, which implies cannot cost more than /r+s-

Based on the above two rules, together with the ordering of the product terms

that guarantees a formula sequence is generated only once, we obtain the following

pruning search procedure:

3 5

1. Form a list of the product terms in the given multi-dimensional summation.

Let Xa denote the a-th product term and X^.dimens the set of index variables

in Set r and c to zero. Set d to the number of product terms.

2. While there exists an summation index (say i) that appears in exactly one term

(say A'a[...]) in the list and a > c, increment r and d and create a formula

/r[---] = A'a[...] where fr-dimens = Xa-dimens— {z}. Remove Ag[...j from the

list. Append to the list = /r[--]- Set c to a.

3. Increment r and d and form a formula /r[---] = A'a[...] x Aô[...] where AaB--] and

%(,[...] are two terms in the list such that a < b and b > c, and give priority

to the terms that have exactly the same set of indices. The indices for fr are

fr-dimens = Xa-dimens U Xh-dimens. Remove Aa[...] and A&[...] from the list.

Append to the list X(i[...] = /r[.-.]. Set c to 6. Go to step 2.

4. When steps 2 and 3 cannot be performed any more, a valid formula sequence is

obtained. To obtain all valid sequences, exhaust all alternatives in step 3 using

depth-first search. The search space can be further pruned using the branch-

and-bound technique by giving up any search path as soon as its partial cost

becomes greater than the lowest cost among all complete sequences found so

far.

2.6 Common Sub-Expressions

In some computational physics applications, the same input array may appear in

more than one product term and each occurrence of an repeated input array may be

associated with different index variables. This leads to common sub-expressions in

36

the multi-dimensional summation. Formulae involving repeated input arrays could

actually be computing the same intermediate result. Consider the following function:

R [ i ,k ,m \ = x B[Lm ] x A[kJ] x B[j,k])j I

Assume Ni = Nj = Nk = Ni = Nm- One formula sequence that computes R [ i .k ,m ]

is:

f i [î7 i , k] = A [ i , j ] x B[j\ k]

l2[hk] = ' ^ f i [ i J : k ] j

fzlk.l^m] = B[Lm] X A[k,l]

h[k ,m] = ^ /3 [A :,/,m ]I

R[i, k, m\ = /2[z, A:] x f^lk, m]

Here, f i and fz are computing the same intermediate result since / 3 [A:,/, m] =

fi[k, /, m]. We call f i and fz equivalent formulae or common sub-expressions and their

costs should be counted only once. In other words, fz can be obtained without any

additional cost. Notice that /g and are also equivalent to each other as f^[k, m] —

/ 2 [A;, m]. Thus, has a zero cost.

To detect common sub-expressions, we compare each newly-formed formula against

each existing formula and see if the two formulae are o f the same type (i.e. they

are both multiplication or both sum m ation), the operand arrays are equivalent, and

the index variables in the operand arrays (and the sum m ation index for sum m ation

formulae) can be mapped one-to-one from one formula to the other. In the above

example, f i and fz are equivalent because they are both products of arrays .4 and B

and the indices k, I and m in fz can be mapped one-to-one to the indices i, j and

37

k in / i respectively. Also, is equivalent to fo because they are both summation,

their operands fz and f \ are equivalent, and the indices k and m can be mapped

one-to-one to i and k.

2.7 Sparse Arrays

When som e input arrays are sparse, the intermediate and final results could be

sparse and the number o f operations required for each formula is lower than if the

arrays are dense. For estim ating the operation count, there is a need to determine the

sparsity o f the intermediate and final results and the reduced cost o f each formula.

We make two assumptions about the sparsity o f the input arrays. First, we assume

that sparsity only exists between two of the array dim ensions, i.e., whether an element

is zero depends only on the indices of exactly two dimensions. Second, we assume

sparsity is uniform, i.e. the non-zero elements are uniformly distributed among the

range of values o f an array dimension if it involves sparsity.

However, the above assumptions are not enough to determine the sparsity of

the result arrays; it depends further on the structure of the sparse arrays. In the

computational physics applications we consider, some array dimensions corresponds

to points in three-dim ensional space and sparsity arises from the fact that certain

physical quantities dim inish with distance and are treated as zero beyond a cut-off

limit. For such sparse arrays we develop an analytical model for the sparsity of

the result arrays in terms o f the sparsity of the operand arrays of a summation or

multiplication formula, based on the following observation. Finding the resulting

sparsity is equivalent to finding the probability that a set of randomly distributed

points in 3D space satisfies some given constraints on their pairwise distances.

3 8

We represent the sparsity of each array as a list of sparsity entries. Under the

above assumptions about input arrays, they can only have a single sparsity entry

each, but intermediate arrays may have multiple sparsity entries. Each sparsity entry

contains the two dimensions of the array involving the sparsity and a sparsity factor.

The sparsity factor is defined as the fraction of non-zero elements between the two

dimensions (which is always between 0 and 1 ) and is proportional to the cube of the

cut-off limit between the two points in 3D space corresponding to the two dimensions.

The overall sparsity of an intermediate array (i.e., the fraction of non-zero elements

in the entire array) is the product o f all sparsity factors in its sparsity entries. The

sparsity o f an array can also be viewed as a graph in which each array dimension

involving sparsity is a vertex and each sparsity entry is an edge that connects the

two vertices for the two array dimensions in that entry. We call this graph a sparsity

graph. For convenience, when an array reference is given, we use index variables to

refer to array dimensions and to label the vertices in a sparsity graph. As an example,

Figure 2.5(a) shows the sparsity entry and the sparsity graph of an array referenced

by .4[i, j , k] in which sparsity exists between the last two dimensions and 2% of its

elements are non-zero.

For a multiplication formula, we combine the sparsity entries of the two operand

arrays to obtain the sparsity entries o f the result array and examine the resulting

sparsity graph. If no cycle is formed in the graph, no further work is needed and

the overall sparsity of the result array is the product o f those of the operand arrays

since the dimensions involving sparsity represent independent points in 3D space. A

cycle of size 2 is formed if both the operand arrays have sparsity between dimensions

referenced by the same pair o f index variables. The two sparsity entries forming the

3 9

Array A[uj,k] B{jJ] A[Lj\k]x A[ i , jJ ]x i r - i- nS y j ]

ü :* a 0 5 ? ü l t o M )

Overallsparsity 0.02 0.05 0.001 0.0004 0.262

k kSparsity 3 k 3 1 j 3 k Igraph

(a) (b) (c) (d) (e)

Figure 2.5: Sparsity entries and sparsity graphs of arrays.

cycle can be coalesced into one by keeping the smaller of the two sparsity factors.

This is because, for both operand elements to be non-zero, the distance between

the pair of points that the two index variables represent in 3D space must be less

than the smaller of the two cut-off lim its corresponding to the two sparsity factors.

Figure 2.5(c) and (d) show the resulting sparsity entries and the sparsity graphs for

the two above cases. Determining the overall sparsity in the presence of cycles of size

3 or larger requires solving multi-dimensional summations and is not considered in

this thesis.

For a summation formula, the overall sparsity of the result array is the proba

bility that for a set of randomly distributed points in 3D space, there exists a point

(corresponding to the summation index) such that some given distance constraints

are satisfied. We first copy the sparsity entries from the operand array to the result

array, paying special attention to the entries involving the summation index. If only

40

one sparsity entry has the sum m ation index, we remove it from the result array. If

the sum m ation index (say i) appears in exactly two sparsity entries, say { i , j , s i ) and

(z, k, S2 ), we replace the two entries with a single entry {j, k, + \ / ^ ^ ) because

the cut-off limit of the distance between the two points represented by j and k equals

the sum o f the two given cut-off lim its. Figure 2.5(e) shows an example of the second

case. W hen the sum m ation index appears in more than two sparsity entries, we need

to solve some multi-dimensional summations.

Once the sparsity of a result array is known, finding the cost of the formula that

computes the result array is straightforward. In a m ultiplication formula, the number

of operations is the same as the number of non-zero elements in the result array. In a

sum m ation formula, the number of operations equals the number of non-zero elements

in the operand array minus the number of non-zero elements in the result array (since

adding N numbers requires only N —1 additions). These operation counts are exact

if mechanisms exist to ensure operations are performed only on non-zero operands;

otherwise, the operation counts would represent lower bounds on the actual number

of operations.

2.8 Fast Fourier Transform

In many of the m ulti-dimensional summations, some of the product terms are

exponential functions of some o f the indices. Since the exponential function is unique,

we consider all product terms that are exponential functions identical. We also assume

the products of exponential functions can be obtained at zero cost since the products

are them selves exponential functions.

41

The existence o f exponential functions in the product terms permits some formulae

to be computed as fast Fourier transforms (FFTs). which may involve significantly

fewer operations than if the same result is computed as a multiplication followed by

a summation. The cost o f an FFT formula equals the number of individual FFTs

to be performed tim es the cost o f each individual FFT . The general form of an FFT

formula is

f r [ K J ] = ^ X [ K , i ] X exp[z\j] i

where K is a set of indices, j ^ K and exp[i,j] = is dense, the

number o f operations in com puting fr is given by

n x C x /Vlog2 iV^ k e K J

where N = max(iVj, Nj) , C is a constant that depends on the FFT algorithm in use

and C X N logg N is the cost of each individual FFT.

Sparsity in the operand array could lower both components of the FFT cost.

Sparsity factors involving the summation index i reduce the size of individual FFTs

and those not involving i reduce the number o f FFTs. W hether X is sparse or not,

the number of operation can be expressed as

Nj

where N = max(/V} x s i z e ( X ) / s i z e { f r ) , Nj), s ize(A) denotes the number of non-zero

elements in array A, and Nj x s i z e { X ) / s i z e { f r ) is the number of non-zero elements

from X that participate in each individual FFT. For example, consider the formula

= H i AT[z, A;, f] X exp[z, j] in which X has sparsity entries (z, fc, 0.16) and

{k , l ,0 .2) . The resulting sparsity of / i is given by (k , l ,0 .2 ) . If N = 800, N j =

Nk = Ni = 100 and C = 10, then s i ze (X ) = 800 x 1 0 0 x 0.16 x 0.2 = 2.56 x 1 0 \

42

s i ze { f i ) = 100^ X 0.2 = 2 X 10^, N = max( — 128, and the number

of operations is x 10 x 128 logs 128 = 1.792 x 10^.

Since the result array and its sparsity are the same whether it is computed using

FFT or not, this choice does not interact with subsequent formulae. Thus, we can

compare the FFT cost with the combined cost of multiplication and sum m ation and

pick the lower one.

2.9 An Example

The algorithm for searching a formula sequence with the minimum number of

operations as described above has been implemented and tested. We have applied

the program on several com putational physics applications that involve very complex

integrals. The optimal solutions generated by the program are usually a factor of two

better than the best manually-optimized solutions. One example application that

determines self-energy in electronic structure of solids is specified by the following

input file to our program:

sum r,rl,RL,RLl,RL2,RL3 Yfr.RL] Y[r,RL2] Y[rl,RL3] Yfrl.RLl]G[RLl,RL,t] G[RL2,RL3,t] exp[k,r] exp[G,r] exp[k,rl] exp[Gl,rl]

RL,RL1,RL2,RL3 1000 t 100 k 10G,G1 1000 r,rl 100000 spairse Y 1,2,0.1 end

The first 2 lines show the sum m ation to be computed. The next 5 lines are the

ranges of the index variables. The last 3 lines specify the sparsity in the input arrays.

Note that the first 4 product terms involve an identical input array (as do the next

4 3

2 product terms) and the input array V is sparse. The constant C in FFT formulae

is set to 10. The best hand-obtained formula sequence has a cost of 3.54 x 1 0 *

operations. Our program enumerated 369 formula sequences and found 42 sequences

of the same minimum cost of 1.89 x 1 0 ^ operations. One minimum-cost sequence

(which involves 2 FFT formulae) generated by the program is shown below:

fl[r,RL,RLl,t] = Y[r,RL] * G[RLl,RL,t] cost= le+12 <r,RL,0.1> f2[r,RLl,t] = sum RL f1[r,RL,RLl,t] cost= 9.9e+ll dense f5[r,RL2,rl,tj = Y[r,RL2] * f2[rl,RL2,t] cost= le+14 <r,RL2,0.1>f6[r,rl,t] = sum RL2 f5[r,RL2,rl,t] cost= 9.9e+13 dense f7[k,r,rl] = exp[k,r] * exp[k,rl] cost= 0 denseflO[r,rl,t] = f6[r,rl,t] * f6[rl,r,t] cost= le+12 densef11[k,r,rl,t] = f7[k,r,rl] * flO[r,rl,t] cost= le+13 densefl3[k,rl,t,G] = fft r f11[k,r,rl,t] * exp[G,r] cost=l.660964e+15 dense fl5[k,t,G,Gl] = fft rl f13[k,rl,t,G] * exp[Gl,rl] cost=l.660964e+13 dense

44

CH A PTER 3

M EMORY USAGE M INIMIZATION

Given an optimal sequence of m ultiplication and summation formulae, the simplest

way to implement it is to com pute the formulae one by one, each coded as a set of

perfectly nested loops, and to store the intermediate results produced by each formula

in an intermediate array. However, in practice, the input and intermediate arrays

could be so large that they cannot fit into the available memory. Hence, there is a

need to fuse the loops as a means of reducing memory usage. By fusing loops between

the producer loop and the consumer loop of an intermediate array, intermediate results

are formed and used in a pipelined fashion and they reuse the same reduced array

space. There are many different ways to fuse the loops, which could result in different

memory usage. The problem of finding a loop fusion configuration that minimizes

memory usage without increasing the operation count is non-trivial.

In this chapter, we develop an optim ization framework that appropriately models

the relation between loop fusion and memory usage. We present algorithms that

find the optim al loop fusion configuration that minimizes memory usage, under both

static and dynamic memory allocation models.

45

Section 3.1 introduces the memory usage minimization problem. Section 3.2 de

scribes some preliminary concepts that we use to formulate our solutions to the prob

lem. Section 3.3 presents an algorithm for finding a memory-optimal loop fusion con

figuration under static memory allocation. Section 3.4 investigates the sub-problem of

determining a memorv-optimal evaluation of expression trees involving large objects

with no or equal fusions. Based on the solution to this sub-problem, we develop in

Section 3.5 an algorithm to solve the memory usage minimization problem under the

dynamic memory allocation model. Sections 3.6, 3.7, and 3.8 discuss how common

sub-expressions, sparse arrays, and fast Fourier transforms affects the memory usage

minim ization problem, respectively. Section 3.9 explores ways to further reduce mem

ory usage at the cost of increasing the number of arithmetic operations. An example

of the application of the memory usage m inim ization algorithm on a com putational

physics problem is given in Section 3.10.

3.1 Introduction

Consider the multi-dimensional sum m ation shown in Figure 3.1(a) computed by

the formula sequence in Figure 3.1(b), which is represented by the expression tree in

Figure 3.1(c). A naive way to implement the com putation is have a set of perfectly-

nested loops for each node in the tree, as shown in Figure 3.2(a). The brackets

indicate the scopes of the loops. Figure 3.2(b) shows how the sizes o f the arrays may

be reduced by the use of loop fusions. It shows the resulting loop structure after

fusing all the loops between A and / i , all the loops among B, C, fo-, and /a , and all

the loops between and fs. Here, A, B, C , / s , and are reduced to scalars. After

fusing all the loops between a node and its parent, all dimensions o f the child array

46

= Z z E ; X X C [ A ; , Z ] )

(a) A multi-dimensional summation x

hU ] = A Ez fs E t= B{j ,k, l ] -X C[k,[]

= /lb ] X A b,^] j] / 2 XW [ k ] = h [ k ] =

(b) A formula sequence computing (a) C[k,l](c) An expression tree for (b)

Figure 3.1: An example multi-dimensional sum m ation and two representations of a computation.

are no longer needed and can be eliminated. The elem ents in the reduced arrays are

now reused to hold different values at different iterations of the fused loops. Each of

those values was held by a different array element before the loops were fused (as in

Figure 3.2(a)). Note that some loop nests (such as those for B and C) are reordered

and some loops within loop nests (such as the and /-loops for S , A , and A )

are permuted in order to facilitate loop fusions.

For now, we assume the leaf node arrays (i.e. input arrays) can be generated

one element at a tim e (by the genu function for array v) so that loop fusions with

their parents are allowed. This assumption holds for arrays in which the value of

each element is a function of the array subscripts, as in many arrays in the physics

computations that we work on. As will be clear later on, the case where an input

array has to be read in or produced in slices or in its entirety can be handled by

disabling the fusion of some or all the loops between the leaf node and its parent.

47

for i for j[ A [ i . j ] = g e n A ( i . j )

for j for k

for 1[ B[j ,k,l]=genB(j ,k , l )

for k for 1[ C[k,l]=genCCk,l)

initialize f1 for i

for j[ fl[j]+=A[i.j]

for j for k

for 1[ f2[j.k.l]=BCj,k,l]xC[k,l]

initialize f3 for j

for k for 1[ f3[j.k]+=f2[j.k.l]

for j for k[ f4[j ,k]=flCj] xf3[j ,k]

initialize f5 for j

for k[ f5[k]+=f4[j,k]

initialize f1 for i

for jr A=genA(i.j)L fl[j]+=A

initialize f3 for k r for 1

C=genC(k,l)^ r j

B=genBCj,k,l) f2=BxC f3[j.k]+=f2

initialize fS for j

for kr f4=fl[j]xf3[j.k]I fS[k]+=f4

for k for 1"[ C[k,l]=genC(k,l)

initialize fS for j

for i[ A[i]=genACi.j) initialize f1 for i [ fl+=A[i] for k

initialize f3 for 1

B=genB(j,k,l) f2=BxC[k.l] f3+=f2

f4=flxf3 f5[k]+=f4

(a) (b) ( c )

h

A

j k h r r

fz

I J

j k l k l

h

A

J kfz

A - -fz-

fI J

Bj k l k l

' A # ' : : ,

j k l k l

(d) (e) (f)

Figure 3.2: Three loop fusion configurations for the expression tree in Figure 3.1.

4 8

Figure 3.2(c) shows another possible loop fusion configuration obtained by fusing

all the J-loops and then all the A:-loops and /-loops inside them. The sizes o f all arrays

except C and /$ are smaller. By fusing the j - , k-, and /-loops between those nodes,

the j - , k-, and /-dimensions of the corresponding arrays can be eliminated. Hence,

B, f i , f 2 , fz , and / i are reduced to scalars while .4 becomes a one-dimensional array.

In general, fusing a /-loop between a node v and its parent eliminates the t-

dimension of the array v and reduces the array size by a factor of iVf. In other words,

the size of an array after loop fusions equals the product of the ranges o f the loops that

are not fused w ith its parent. We only consider fusions of loops among nodes that are

all transitively related by (i.e. form a transitive closure over) parent-child relations.

Fusing loops between unrelated nodes (such as fusing siblings w ithout fusing their

parent) has no effect on array sizes. We also restrict our attention for now to loop

fusion configurations that do not increase the operation count. The tradeoff between

memor}' usage and arithm etic operations is considered in Section 3.9.

In the class o f loops considered in this thesis, the only dependence relations are

those between children and parents, and array subscripts are sim ply loop index vari

ables \ So, loop permutations, loop nests reordering, and loop fusions are always legal

as long as child nodes are evaluated before their parents. This freedom allows the

loops to be permuted, reordered, and fused in a large number of ways that differ in

memory usage. Finding a loop fusion configuration that uses the least memory is not

trivial. We believe this problem is NP-com plete but have not found a proof yet.

^When array subscripts are not simple loop variables, as many researchers have studied, more dependence relations exist, which prevent some loop rearrangement and/or loop fusions. In that case, a restricted set of loop fusion configurations would need to be searched in minimizing memory usage.

49

F u sio n g ra p h s . Let T be an expression tree. For any given node v G T, let

subtreeiv) be the set of nodes in the subtree rooted at v, v.parent be the parent

of V, and v.indices be the set of loop indices for v (including the summation index

V.sumindexif u is a summation node). A loop fusion configuration can be represented

by a graph called a fusion graph, which is constructed from T as follows.

1 . Replace each node u in T by a set o f vertices, one for each index i 6 v.indices.

2 . Remove all tree edges in T for clarity.

3. For each loop fused (say, of index i) between a node and its parent, connect the

i-vertices in the two nodes w ith a fusion edge.

4. For each pair of vertices w ith matching index between a node and its parent, if

they are not already connected with a fusion edge, connect them with a potential

fusion edge.

Figure 3.2 shows the fusion graphs alongside the loop fusion configurations. In the

figure, solid lines are fusion edges and dotted lines are potential fusion edges, which

are fusion opportunities not exploited. As an example, consider the loop fusion con

figuration in Figure 3.2(b) and the corresponding fusion graph in Figure 3.2(e). Since

the j - , k-, and Z-loops are fused between /a and /a, there are three fusion edges, one

for each of the three loops, between / 2 and /$ in the fusion graph. Also, since no

loops are fused between fz and f^, the edges between fz and / 4 in the fusion graph

remain potential fusion edges.

In a fusion graph, we call each connected component o f fusion edges (i.e. a maximal

set o f connected fusion edges) a fusion chain, which corresponds to a fused loop in the

50

loop structure. The scope of a fusion chain c, denoted scope{c), is defined as the set

of nodes it spans. In Figure 3 .2(f), there are three fusion chains, one for each of the

j - , k-, and I-Ioops: the scope of the shortest fusion chain is ( 5 , /a ,/a } . The scope of

any two fusion chains in a fusion graph must either be disjoint or a subset/superset of

each other. Scopes o f fusion chains do not partially overlap because loops do not (i.e.

loops must be either separate or nested). Therefore, any fusion graph with fusion

chains whose scopes are partially-overlapping is illegal and does not correspond to

any loop fusion configuration.

Fusion graphs help us visualize the structure of the fused loops and find further

fusion opportunities. If we can find a set of potential fusion edge(s) that, when

converted to fusion edge(s), does not lead to partially overlapping scopes of fusion

chains, then we can perform the corresponding loop fusion(s) and reduce the size of

some array(s). For example, the z-loops between A and f i in Figure 3.2(f) can be

further fused and array .4 would be reduced to a scalar. If converting all potential

fusion edges in a fusion graph to fusion edges does not make the fusion graph illegal,

then we can completely fuse all the loops and achieve optimal memory usage. But for

many fusion graphs in real-life loop configurations (including the ones in Figure 3.2),

this does not hold. Instead, potential fusion edges may be mutually prohibitive;

fusing one loop could prevent the fusion of another. In Figure 3.2(e), fusing the

j-loops between / i and / i would disallow the fusion o f the &-loops between fz and / j .

A lthough a fusion graph specifies what loops are fused, it does not fully determine

the permutations of the loops and the ordering of the loop nests. As we will see

in Section 3.5, under dynamic memory allocation, reordering loop nests could alter

memory usage without changing array sizes.

51

3.2 Preliminaries

So far. we have been describing the fusion between a node and its parent by the

set of fused loops (or the loop indices such as {z, j } ) . But in order to compare loop

fusion configurations for a subtree, it is desirable to include information about the

relative scopes of the fused loops in the subtree.

S c o p e a n d fu s io n sco p e o f a lo o p . The scope of a loop o f index i in a subtree

rooted at n, denoted scope{i,v) , is defined in the usual sense as the set of nodes in

the subtree that the fused loop spans. That is, if the z-loop is fused, scope(i, v) =

scope(c) n subtree{v), where c is a fusion chain for the z-loop w ith v € scope{c). If

the z-loop of V is not fused, then scope{i, v) = 0. We also define the fusion scope of

an i-loop in a subtree rooted at u as fscope{i, v) = scopefi, v) if the z-loop is fused

between v and its parent; otherwise fscope(i, v) = 0. As an example, for the fusion

graph in Figure 3.2(e), scope{j, fz) = {B , /s , f z } , but fscope{j, fz ) = 0-

Indexset sequence. To describe the relative scopes of a set of fused loops,

we introduce the notion of an indexset sequence, which is defined as an ordered list

of disjoint, non-empty sets of loop indices. For example, / = ({z, A;}, {j}) is an

indexset sequence. For simplicity, we write each indexset in an indexset sequence

as a string. Thus, / is written as { ik , j ) . Let g and g' be indexset sequences. We

denote by |^| the number of indexsets in g, ÿ[r] the r-th indexset in g, and Set{g)

the union of all indexsets in g, i.e., Set{g) = Ui<r<|g| instance, | / | = 2 ,

/ [I ] = {z, A:}, and Set{ f) = Set{{j , i , k)) = { i , j , k } . We say that g' is a prefix of g

if \g'\ < |5 |> g'lWl] ^ g [|y |] , and for all 1 < r < |^'|, g'[r] = g[r]. We write this

relation as prefix{g', g ) . So, (), (z), {k), (ik), ( i k , j ) are prefixes of / , but ( i , j ) is

not. The concatenation of g and an indexset x, denoted g + x, is defined as the

52

indexset sequence g" such that if x 7 0 , then |^"| = |^| + 1 , g"\\g"\\ = x, and for all

1 < r < \g'% g"[r] = p[r]; otherwise, g" = g.

F u sion . We use the notion of an indexset sequence to define a fusion. Intuitively,

the loops fused between a node and its parent are ranked by their fusion scopes in

the subtree from largest to smallest: two loops with the same fusion scope have the

sam e rank (i.e., are in the sam e indexset). For example, in Figure 3.2(f), the fusion

between /g and /s is ( jk l ) and the fusion between and /s is (j. A:) (because the

fused j-loop covers two more nodes, A and / i ) . Formally, a fusion between a node v

and v.parent is an indexset sequence / such that

1 . Set{f) Ç. v. indicesDv.parent. indices,

2 . for all i G Se t{ f ) , the i-loop is fused between v and v.parent, and

3. for all i G /[r ] and i' G f[r'],

(a) r = r' iff fscope(i ,v) = fscope{i', v ) , and

(b) r < r' iff fscope{i, v) D fscope{i', v).

N e s t in g . Similarly, a nesting o f the loops at a node v can be defined as an

indexset sequence. Intuitively, the loops at a node are ranked by their scopes in the

subtree; two loops have the sam e rank (i.e., are in the same indexset) if they have

the same scope. For exam ple, in Figure 3.2(e), the loop nesting at fz is at f^

is ( jk) , and at B is ( jk l ) . Formally, a nesting of the loops at a node v is an indexset

sequence h such that

1 . Set{h) = v.indices and

2. for all i G h[r] and i' G h[r'],

53

(a) r = r' iff scope{i, v) = scope{i', v), and

(b) r < r' iff scope{i ,v) D scope{i',v).

By definition, the loop nesting at a leaf node v must be {v.indices) because all loops

at V have empty scope.

L eg a l fu sio n . A legal fusion graph (corresponding to a loop fusion configuration)

for an expression tree T can be built up in a bottom -up manner by extending and

merging legal fusion graphs for the subtrees of T. For a given node u, the nesting h

at V summarizes the fusion graph for the subtree rooted at v and determines what

fusions are allowed between v and its parent. A fusion / is legal for a nesting h a.t v

if prefix{f, h) and s e t ( /) Ç v.parent.indices. This is because, to keep the fusion graph

legal, loops with larger scopes must be fused before fusing those with smaller scopes,

and only loops common to both v and its parent may be fused. For example, consider

the fusion graph for the subtree rooted at /a in Figure 3.2(e). Since the nesting at / 2

is ( k l . j ) and fz.indices = {jf, k, I}, the legal fusions between fo and fz are (), (k), (/),

(kl), and {k l . j ) . Notice that all legal fusions for a node v are prefixes of a maximal

legal fusion, which can be expressed as

MaxFusion{h,v) = m a x { / \ prefixd^f, h) and s e t ( /) Ç v.parent.indices)

where h is the nesting at v. In Figure 3.2(e), the maximal legal fusion for C is {kl),

and for fo is {k l , j ) .

R e su lt in g n e s t in g . Let u be the parent of a node v. If v is the only child of u,

then the loop nesting at u as a result of a fusion / between u and v can be obtained

by the function

ExtNesting{f, u) = f + {u.indices — Set{ f ) )

54

For example, in Figure 3.2(e), if the fusion between fo and /s is (kl), then the nesting

at fz would be

C o m p a tib le n e s tin g s . Suppose v has a sibling v', f is the fusion between u and

V, and f ' is the fusion between u and v'. For the fusion graph for the subtree rooted

at u (which is merged from those of v and v') to be legal, h = ExtNesting(f, u) and

h' = ExtNesting{f', u) must be compatible according to the condition:

for all i G h[r] and j G h[s],

if r < s and i G h'[r'\ and j G h'[s'], then r' < s'.

This requirement ensures an i-loop that has a larger scope than a j-loop in one subtree

will not have a smaller scope than the j-loop in the other subtree. If h and h' are

compatible, the resulting loop nesting at u (as merged from h and h') is h" such that

for all i G h"[r"] and j G h"[s"],

if i G h[r], i G h'[r'], j G h[s], and j G h'[s'], then

1 . r” = s" implies r = s and r' = s, and

2 . r" < s" implies r < s and r' < s'.

Effectively, the loops at u are re-ranked by their combined scopes in the two subtrees

to form h". As an example, in Figure 3.2(e), if the fusion between f i and / i is / = (j)

and the fusion between fz and / 4 is f = (k), then h = ExtNesting{f, Z ) = {j, k) and

h' = E xtN es t ing{f , f f ) = { k , j ) would be incompatible. But if / is changed to (),

then h = ExtNesting{f, ff) = {jk) would be compatible with h', and the resulting

nesting at f would be {k , j ) . A procedure for checking if h and h' are compatible

and forming h" from h and h' is provided in Section 3.3.

55

T h e “m o r e -c o n s tr a in in g ” r e la t io n o n n e s t in g s . A nesting at a node v is

said to be more or equally constraining than another nesting h' at the same node,

denoted h Ç h', if for all legal fusion graph G for T in which the nesting at v is h,

there exists a legal fusion graph O' for T in which the nesting at v is h' such that

the subgraphs of G and G' induced by T — subtree{v) are identical. In other words,

h n. h' means that any loop fusion configuration for the rest of the expression tree

that works with h also works with h'. This relation allows us to do effective pruning

among the large number of loop fusion configurations for a subtree in Section 3.3. It

can be proved that the necessary and sufficient condition for A Ç A' is that

for all i 6 m[r\ and j G m[s], there exist r', s' such that

1. 2 G m'[r'\ and j G m'[s'],

2. r — s implies r' — s', and

3. r < s implies r' < s'

where m = MaxFusion{h,v) and m' = MaxFusion{h',v). Comparing the nesting at

fz between Figure 3.2(e) and (f), the nesting { k l . j ) in (e) is more constraining than

the nesting (jkl ) in (f). A procedure for determ ining if A □ A' is given in Section 3.3.

3.3 Static M em ory Allocation

Under the static memory allocation m odel, since all the arrays in a program exist

during the entire com putation, the memory usage of a loop fusion configuration is

simply the sum of the sizes of all the arrays (including those reduced to scalars).

Figures 3.3 to 3.6 shows a bottom-up, dynamic programming algorithm for finding

a memory-optimal loop fusion configuration for a given expression tree T. For each

56

node V in t, it com putes a set o f solutions v .soins for the subtree rooted at v. Each

solution s in v . soins represents a loop fusion configuration for the subtree rooted at

V and contains the following information for s: the loop nesting s.nesting at v, the

fusion s.fusion between v and its parent, the memory usage s .m em so far, and the

pointers s .srcl and s.src2 to the corresponding solutions for the children of v.

The set of solutions v . soins is obtained by the following steps. First, if v is a leaf

node, initialize the solution set to contain a single solution using I n itS o in s. Other

wise, take the solution set from a child v.childl of v, and, if v has two children, merge

it (using M erg eS o ln s) with the com patible solutions from the other child v.child2.

Then, prune the solution set to remove the inferior solutions using P ru n eS o ln s . A

solution s is inferior to another unpruned solution s' if s.nesting is more or equally

constraining than s' .nesting and s does not use less memory than s' . Next, extend

the solution set by considering all possible legal fusions between v and its parent (see

E x tS o ln s ) . The size of array v is added to memory usage by A d d M em U sa g e .

Inferior solutions are also removed.

Although the com plexity of the algorithm is exponential in the number of index

variables and the number of solutions could in theory grow exponentially with the

size of the expression tree, the number of index variables in practical applications is

usually small and there is indication that the pruning is effective in keeping the size

of the solution set in each node to a sm all number.

T he algorithm assumes the leaf nodes may be freely fused with their parents and

the root node array must be available in its entirety at the end of the computation.

If these assumptions do not hold, the In itF u s ib le procedure can be easily modified

to restrict or expand the allowable fusions for those nodes.

57

MinMemFusion {T):InitFusible {T)foreach node v in some bottom-up traversal of T

if v.nchildren = 0 then 51 = InitSolns (u)

else51 = v.childl.soins if v.nchildren = 2 then

51 = MergeSolns (Sl,v.child2.solns)S i — PruneSolns (51, u)

V . soins = ExtendSolns (51, v)T.rootoptsoln = the single element in T.root.solns foreach node v in some top-down traversal of T

s = v.optsoln v.optfusion = s.fusion s i = s.srclif v.nchildren = 1 then

v .child 1 .opts oln = s i else

v.childl.optsoln = sl.srcl v.child2.optsoln = sl.src2

InitFusible (T): foreach v Ç.T

if u = T.root then V . fusible = 0

elseV . fusible — V.indices n v .parent.indices

InitSolns {v):s.nesting — {v.fusible)InitMemUsage (s) return {s}

Figure 3.3: Algorithm for static memory allocation.

5 8

MergeSolns (51,52):5 = 0foreach s i E 51

foreach s2 E 52 s.nesting = MergeNesting {s\.nesting, s2.nesting) if s.nesting ^ () then / / if s i and s2 are compatible

s.srcl — s i s.srcS = s2MergeMemUsage (sl,s2 ,s)AddSoIn (s, 5)

return 5

PruneSolns (51,u):5 = 0foreach si E 51

s.nesting = M.axFusion (sl.nesting,v)AddSoln (s, 5)

return 5

ExtendSolns (51,u):5 = 0foreach si E 51

foreach prefix / of sl.nesting s.fusion = /s.nesting = ExtNesting {f,v.parent) s.srcl = sisize = FusedSize (u, /)AddMemUsage {v, f, size, s i, s)AddSoln (s, 5)

return 5

AddSoln (s, 5): foreach s' E 5

if Inferior (s, s') then return

else if Inferior (s',s) then 5 = 5 - {s'}

5 = 5 U {s}

Figure 3.4: Algorithm for static memory allocation, (cont.)

59

M ergeNesting {h,h')i9 = 0 r — r' = 1

X = x' = 0while r < |/i| or r' < \h'\

if X = 0 then X = h[r + +]

if x' = 0 then x' = h'\r' + +]

j/ = X n x ’ if y = 0 then

return () / / h and h' are incompatible5 = + y y = x - x '

x' = x' — X

x = y end whilereturn g / / h and h' are compatible

hQ h': II test if ft is more/equally constrziining than ft'r' = 1 x' = 0for r = I to |ft|

if x' = 0 then if r' > I ft'I then

return false x' = ft'[r' + +]

if ft[r] g x' then return false

x' = x' — ft[r] return true

InitM emUsage (s): s.mem = 0

AddMemUsage (u, / , size, s i, s): s.mem = sl.mem + size

MergeMemUsage (sl,s2, s): s.mem = sl.mem 4- s2.mem

Figure 3.5: Algorithm for static memor>' allocation, (cont.)60

Inferior (s, s') = s.nesting Ç s' .nesting and s.mem > s'.mem

FusedSize (u, / ) = v.indices — {u.stimindei} — /)

ExtNesting (/, u) = f -i- {u.indices — Set (/))

MaxFusion {h.v) = max{/ | preyiz(/, A) and Set (/) Ç v.parent.indices}

Set (/) = Ul<r<|/1 /M

Figure 3.6: Algorithm for static memory allocation, (cont.)

V line src nesting fusion ext-nest memory usage optA I (u) (O') (0 ) 1 VB 2 (jW) (jkl) ( jtZ > 1 VC 3 (to ( t o (tO ;) 1

4 ( t o (t) (t,jO 15 V5 ( t o (0 ( O j t ) 406 (tz> 0 [jkl] 600

h 7 1 (ü> (i) {j: k) 1 + 1 = 28 1 (Ü) 0 {jk) 1 + 100 = 101 V

/2 9 2,3 (W,;> ( t o j ) (ArO;) (1 + 1) + 1 = 310 2,4 (A:,;0 ( A , J 0 ( A : , ;0 (1 + 15) + 1 = 17 y11 2,5 ( O j k ) ( o y t ) (OjA:) ( l + 4 0 ) + l = 4 212 2 ,6 ( j t o (jtO (J'A:/) (1 + 600) + 1 = 602

h 13 10 ( t , ; 0 (A :,;) { k j ) 17 + 1 = 18 y14 12 ( j t o (jk) (J'A:) 602 + 1 = 603

h 15 7,14 C7,t> (Â k) {j: k) (2 + 603) + 1 = 60616 8,13 {k,j ) (A:,J> (101 + 18) + 1 = 120 y17 8,14 {jk) {jk) (jAi) (101 +603) + 1 = 705

h 18 16 (& ,J ) 0 1 0 1 120 + 40 = 160 y

Table 3.1: Solution sets for the subtrees in the example.

61

To illustrate how the algorithm works, consider again the empty fusion graph in

Figure 3.2(d) for the expression tree in Figure 3.1(c). Let Ni = -500, Nj = 100,

Nk = 40, and iV) = 15. There are 2 = 8 different fusions between B and / 2 . Among

them, only the full fusion {jkl) is in B . soins because all other fusions result in more

constraining nestings and use more memory than the full fusion and are pruned.

However, this does not happen to the fusions between C and fo since the resulting

nesting ( k l . j ) of the full fusion (kl) is not less constraining than those of the other 3

possible fusions. Then, solutions from B and C are merged together to form solutions

for /g . For example, when the two full-fusion solutions from B and C are merged, the

merged nesting for f-i is (k l , j ) , which can then be extended by full fusion (between

/ 2 and /a) to form a full-fusion solution for /a that has a memory usage of only 3

scalars. Again, since this solution is not the least constraining one, other solutions

cannot be pruned. Actually, although this solution is optimal for the subtree rooted

at /a, it turns out to be non-optimal for the entire expression tree. Table 3.1 shows

the solution sets for all of the nodes. The “src” column contains the line numbers

of the corresponding solutions for the children. The “ext-nest” column shows the

resulting nesting for the parent. A mark indicates the solution forms a part of

an optim al solution for the entire expression tree. The fusion graph for the optimal

solution is shown in Figure 3.7(a).

Once an optimal solution is obtained, we can generate the corresponding fused

loop structure from it. The following procedure determines an evaluation order of the

nodes:

1. Initialize set P to contain a single node T.root and list L to an empty list.

62

J k

h

A^3

h

/4

Bj k l k l

(a)

initialize fl for i

for jA=genA(i,j) f l[j]+=A

initialize f5 for k

for 1C[l]=genC(k,l)

for jinitialize f3 for 1

B=genB(j ,k,l) f2=BxC[l] fS+=f2

f4=fl[j] xfS f5[k]+=f4

(b)

Figure 3 .r: An optimal solution for the example.

2. W hile P is not empty, remove from P a node v where v.optfusion is maximal

am ong all nodes in P , insert v at the beginning of L, and add the children of v

(if any) to P .

3. W hen P is empty, L contains the evaluation order.

Putting the loops around the array evaluation statem ents is trivial. The initialization

of an array can be placed inside the innermost loop that contains both the evaluation

and use o f the array. The optim al loop fusion configuration for the example expression

tree is shown in Figure 3.7(b).

63

3.4 M emory-Optimal Evaluation Order of Unfused Expression Trees

We now address the problem o f finding an evaluation order o f the nodes in a given

unfused expression tree that minimizes memory usage under the dynamic memory

allocation model. In this section, we do not consider any loop fusions between the

nodes in an expression tree. The solution developed in this section will be applied to

fused expression trees in Section 3.5.

In this problem, the expression tree must be evaluated in som e bottom -up order,

i.e., the evaluation of a node cannot precede the evaluation o f any of its children.

Since the nodes in the expression tree are not fused, any bottom -up traversal is a

legal evaluation order. The nodes o f the expression tree are large data objects whose

sizes are given. The total size o f the data objects could be so large that they cannot

all fit into memory at the same tim e. Space for the data objects has to be allocated

and deallocated dynamically. Due to the parent-child dependence relation, a data

object cannot be deallocated until its parent node data object has been evaluated.

The objective is to minimize the maximum memory usage during the evaluation of

the entire expression tree. This problem could also arise in other applications such

as register allocation, database query optim ization, and data m ining.

As an example of the unfused memory usage optimization problem, consider the

expression tree shown in Figure 3.8. The size of each data object is shown alongside

the corresponding node label. Before evaluating a data object, space for it must be

allocated. That space can be deallocated only after the evaluation of its parent is

complete. There are many allowable evaluation orders of the nodes. One of them

is the post-order traversal (A, B, C, D , E, F, G, H, I) of the expression tree. It has a

64

[ : 16

F : 15 H : b

IF : 3 F : 16 G : 25

A : 20 D : 9

C : 30

Figure 3.8: An example unfused expression tree.

maximum memory usage of 45 units. This occurs during the evaluation of H, when

F , G, and H are in memory. Other evaluation orders may use more memory or less

memory. Finding the optimal order (C, D, G, H, A. B , E , F, I) , which uses 39 units of

memory, is non-trivial.

Section 3.4.1 formally defines the unfused m em oiy usage optimization problem

and makes som e observations about it. Section 3.4.2 presents an efficient algorithm

that solves the problem in time for an n-aode expression tree. We show the

correctness o f the algorithm in Section 3.4.3.

3.4.1 Problem Statement

The problem addressed is the optimization of memor}- usage in the evaluation of

a given expression tree, whose nodes correspond to large data objects of various sizes.

Each data object depends on all its children (if any), and thus can be evaluated only

after all its children have been evaluated. The goal is to find an evaluation order of

the nodes that uses the least amount of memory’. Since an evaluation order is also a

65

traversal of the nodes, we will use these two terms interchangeably. Space for data

objects is dynamically allocated or deallocated under the following assumptions:

1. Each object is allocated or deallocated in its entirety.

2. Leaf node objects are created or read in as needed.

3. Internal node objects must be allocated before their evaluation begins.

4. Each object must remain in memory until the evaluation of its parent is com

pleted.

We define the problem formally as follows:

Given a tree T and a size v.size for each node v E T, find a computation

of T that uses the least memory, i.e., an ordering P — {vi, . . . , v^) of

the nodes in T, where n is the number of nodes in T, such that

1. for all Vi, Vj, if Vi is the parent of vj, then i > j] and

2. max„;ep{himem(uj, P ) } is minimized, where

himem(uj, P ) = lomem(ui_i, P ) + Vi.size

lom em («i,P ) = | P ) - E,ch„d or.,)

Here, himem(ui, P ) is the memory usage during the evaluation of in the traversal

P , and lomem(uf, P ) is the memory usage upon completion of the same evaluation.

In general, before evaluating Vi, we need to allocate space for u,-. After Vi is eval

uated, the space allocated to all its children may be released. As an illustration,

consider the post-order traversal P = (A, B, C, D , E, F, G, H, I) o f the expression

66

Node himem lomemA 20 20B 23 3C 33 33D 42 12E 28 19F 34 15G 40 40H 45 20I 36 16

max 45

Node himem lomema 25 25H 30 5a 35 35D 44 14E 30 21A 41 41B 44 24F 39 20I 36 16

max 44

Node himem lomemC 30 30D 39 9a 34 34H 39 14A 34 34B 37 17E 33 24F 39 20I 36 16

max 39

(a) Post-order traversal (b) A better traversal (c) The optimal traversal

Table 3.2; Memory usage of three different traversals o f the expression tree in Figure 3.8.

tree shown in Figure 3.8. During and after the evaluation of A, A is in memory. So,

himem(.4, P ) = lom em (4, P) = A.size = 20. To evaluate B, we need to allocate

space for B, thus him em (B, P) = lomem(-4, P ) + B.s ize = 23. After B is obtained, A

can be deallocated, giving lomem(B, P ) = h im em (S , P ) — A.size = 3. The memory

usage for the rest of the nodes can be determined in a similar fashion and is shown

in Table 3.2(a).

However, the post-order traversal of the given expression tree is not optimal in

memory usage. For this example, none of the traversals that visit all nodes in one sub

tree before visiting another subtree is optim al. For the given expression tree, there are

four such traversals. They are (4 , B, C, D , E , F, G, H, I), {C, D, E, 4 , B, F, G, H, / ) ,

(G, H, 4 , B, G, D, E, F, / ) , and (G, H, G, D , E, 4 , B , F, I). If we follow the traditional

wisdom of visiting the subtree that uses more memory first, we obtain the best o f

the four traversals, which is {G, H ,C , D , E , A, B , F, I) . Its overall memory usage is

67

44 units, as shown in Table 3.2(b), and is not optimal. The optimal traversal is

(C, D, G, H, .4, B, E, F. I) , which uses 39 units o f memory (see Table 3.2(c)). Notice

that it ‘jum ps' back and forth between the subtrees. Therefore, any algorithm that

only consider traversals that visit subtrees one after another may not produce an

optim al solution.

One possible approach to the unfused memorv- usage optim ization problem is to

apply dynam ic programming on an expression tree as follows. Each traversal can

be viewed as going through a sequence of configurations, each configuration being

a set o f nodes that have been evaluated (which can be represented more compactly

as a smaller set o f nodes in which none is an ancestor or descendant of another).

In other words, the set of nodes in a prefix o f a traversal forms a configuration.

Common configurations in different traversals form overlapping sub-problems. A

configuration can be formed in many ways, corresponding to different orderings of

the nodes. The optimal way to form a configuration Z containing k nodes can be

obtained by minimizing over all valid configurations that are k — 1-subsets of Z. By

finding the optimal costs for all configurations in the order of increasing number of

nodes, we get an optimal traversal o f the expression tree. However, this approach is

inefficient in that the number o f configurations is exponential in the number of nodes.

The unfused memory usage optim ization problem has an interesting property: an

expression tree or a subtree may have more than one optimal traversal. For example,

for the subtree rooted at F, the traversals (C, D , F, A, B, F) and (C, D, A, B, F, F)

both use the least memory space of 39 units. One might attem pt to take two optimal

subtree traversals, one from each child of a node AT, merge them together optimally,

and then append X to form a traversal for X . Although the optim al merge can

68

be evaluated in 0 { n m ) tim e using dynamic programming (where n and m are the

lengths of the two subtree traversals), the resulting traversal may not be optimal for

X . Continuing with the above example, if we merge together (C, D , E . .4, B, F) and

{G, H) (which are optimal for the subtrees rooted at F and H, respectively) and then

append / , the best we can get is a sub-optimal traversal (G, H, C, D , E, .4, B, F, I)

that uses 44 units of memory (see Table 3.2(b)). However, the other optim al traversal

(G, D, .4, B, E, F) for the subtree rooted at F can be merged with (G, H) to form

{C, D . G , H, A, B, E, F, I) (w ith I appended), which is an optim al traversal of the

entire expression tree. Hence, some optimal traversals of a subtree m ay not appear as

subsequences in any optim al traversal of the entire expression tree. In other words,

locally optimal traversals may not be globally optimal. In the next section, we present

an efficient algorithm that finds traversals which are not only locally optim al but also

globally optimal.

3.4.2 An Efficient Algorithm

We now present an efficient divide-and-conquer algorithm that, given an expres

sion tree whose nodes are large data objects, finds an evaluation order o f the tree that

minimizes the memor\- usage. For each node in the expression tree, it computes an

optimal traversal for the subtree rooted at that node. The optim al subtree traversal

that it computes has a special property: it is not only locally optim al for the subtree,

but also globally optimal in the sense that it can be merged together with globally

optimal traversals for other subtrees to form an optim al traversal for a larger subtree

69

which is also globally optimal. As we have seen in Section 3.4.1, not all locally op

tim al traversals for a subtree can be used to form an optimal traversal for a larger

tree.

The algorithm stores a traversal not as an ordered list of nodes, but as an ordered

list of indivisible units called elements. Each element contains an ordered list of

nodes with the property that there necessarily exists some globally optimal traversal

of the entire tree wherein this sequence appears undivided. Therefore, as we show

later, inserting any node in between the nodes of an element does not lower the total

memory usage. An element initially contains a single node. But as the algorithm

goes up the tree merging traversals together and appending new nodes to them,

elements may be appended together to form new elements containing a larger number

of nodes. Moreover, the order of indivisible units in a traversal stays invariant, i.e.,

the indivisible units must appear in the same order in some optimal traversal o f the

entire expression tree. This means that indivisible units can be treated as a whole and

we only need to consider the relative order of indivisible units from different subtrees.

Each element (or indivisible unit) in a traversal is a {nodelist, hi, la) triple, where

nodelist is an ordered list of nodes, hi is the maximum memorj’’ usage during the

evaluation of the nodes in nodelist, and lo is the memory usage after those nodes are

evaluated. Using the terminology from Section 3.4.1, hi is the highest himem among

the nodes in nodelist, and lo is the lomem of the last node in nodelist. The algorithm

always maintains the elements of a traversal in decreasing hi and increasing lo order,

which implies in order of decreasing hi-lo difference. In Section 3.4.3, we prove that

arranging the indivisible units in this order minimizes memory usage.

70

M inM em T raversal (T):foreach node v in some bottom-up traversal of T

v.seq = 0 / / an empty listforeach child u of v

v.seq = M ergeSeq(u.seç, u.seç) if \v.seq\ > 0 th en / / |z| is the length of x

base = V .seq\\v .seq\\.lo else

base = 0A p p en d S eq [v.seq, {v), v.size -f base, v.size)

nodelist = () for 2 = 1 to \T.root.seq\

nodelist = nodelist + T.root.seq[i\.nodelist / / 4- is the concatenation operatorreturn nodelist / / memory usage is T.root.seq[l\.hi

M ergeSeq (S’1,52):S = {) i = j = I basel = base2 = 0 w hile i < |5 l | or j < |52|

i f J > |52| or [i < [S'il and S'1 [î].A2 — S’1[z].1o > S2\j\.hi — S2\j\.lo) th en A pp en dSeq {S, S\\i].nodelist, Sl\i].hi + basel, 5l[z]./o -f- basel) baseS = 5l[2]./o z4--f-

elseA pp en dSeq (S, S2[j].nodelist, S2\j\.hi 4- baseS, S2\j].lo 4- baseS) basel = S\]j].loj + +

end w hile return S

A p p en d S eq {S, nodelist, hi, lo):E = {nodelist, hi, lo) / / new element to append to S i = \S\w h ile 2 > 1 and {E.hi > S[i].hi or E.lo < 5[2’]./o)

E = {S[i].nodelist +E.nodelist, max.{S[i].hi, E.hi), E.lo) / / [z] is combined into E remove 5[z’] from S i —

end w hileS = S + E / / |5 | is now 2 4-1

Figure 3.9: Procedure for finding an memory-optim al traversal of an expression tree.

71

Figure 3.9 shows the algorithm. The input to the algorithm (the M in M e m

T raversa l procedure) is an expression tree T, in which each node v has a field v.size

denoting the size o f its data object. The procedure performs a bottom-up traversal

of the tree and, for each node v. computes an optim al traversal v.seq for the subtree

rooted at v. The optim al traversal v.seq is obtained by optim ally merging together

the optimal traversals u.seq from each child u of u, and then appending v. At the end,

the procedure returns a concatenation of all the nodelists in T.root.seq as the optimal

traversal for the given expression tree. The memory usage of the optimal traversal is

T.root.seq[l].hi.

The M e r g e S e q procedure merges two given traversals 51 and 52 optim ally and

returns the merged result 5 . 51 and 52 are subtree traversals of two children nodes

of the same parent. The optimal merge is performed in a fashion similar to merge-

sort. Elements from 51 and 52 are scanned sequentially and appended into 5 in the

order of decreasing hi-lo difference. This order guarantees that the indivisible units

are arranged to minimize memory usage. Since 51 and 5 2 are formed independently,

the hi-lo values in the elements from 51 and 52 must be adjusted before they can be

appended to 5 . The amount of adjustment for an element from 51 (52) equals the

lo value o f the last merged element from 5 2 (51), which is kept in basel (baseS).

The A p p e n d S e q procedure appends a new elem ent specified by nodelist, hi, and

lo to the given traversal 5 . Before the new element E is appended to 5, it is combined

with elements at the end of 5 whose hi is not higher than E.hi or whose lo is not lower

than E.lo. The combined element has the concatenated nodelist and the highest hi

but the original E.lo. Elements are combined to form larger indivisible units.

72

I : 16

H : 5

A : 20 D : 9

C : 30

Node V Optimal traversal v.seqA ((A, 20,20))B ((AB,23,3))a ((C,30,30)>D ((CD, 39,9))E ((CD, 39,9), (F, 25,16))F ((CD, 39,9), {ABEF, 34,15))G ((C,25,25))H ((C ff,30,5))I { {C D G H A B E F I ,39 ,16))

(a) The expression tree in Figure 3.8. (b) Optimal traversais for subtrees.

Figure 3.10: Optimal traversais for the subtrees in the expression, tree in Figure 3.8.

To illustrate how the algorithm works, consider the expression tree shown in Fig

ure 3.8 and reproduced in Figure 3.10(a). We visit the nodes in a bottom-up or

der. Since A has no children, A.seq = ((A, 20, 20)) (for clarity, we write nodelists

in a sequence as strings). To form B.seq, we take A.seq and append a new element

{B, 3 -r 20,3) to it. The A p p en d S e q procedure combines the two elements into one,

leaving B.seq = {{AB, 23 ,3 )). Here, A and B form an indivisible unit, implying that

B must follow A in some optimal traversal of the entire expression tree. Similarly, we

get E.seq = { {C D , 39 ,9), {E, 25,16)). For node F, which has two children B and E,

we merge B.seq and E.seq by the order of decreasing hi-lo difference. So, the elements

merged are first {C D , 39 ,9), then {AB, 23-1-9, 34-9), and finally {E, 254-3,164-3) with

the adjustments shown. They are the three elements in E.seq after the merge as no

elements are combined so far. Then, we append to E.seq a new element {F, 15-1-19,15)

for the root of the subtree. The new element is combined with the last two elements

in E.seq. Hence, the final content of E.seq is ((C D , 39,9), (A 5 £ 'F , 34,15)), which

73

consists o f two indivisible units. The optimal traversals for the other nodes are com

puted in the same way and are shown in Figure 3.10(b). At the end, the algorithm

returns the optimal traversal (C, D , G, H, A, B, E, F, I) for the entire expression tree

(see Table 3.2(c)).

The time complexity of this algorithm is O(n^) for an n-node expression tree

because the processing for each node v takes 0 { m ) tim e, where m is the number of

nodes in the subtree rooted at v. Although the A p p e n d S e q procedure has a while-

loop in it, the total number o f iterations of the loop during the processing for a node

cannot exceed m. On average, the performance o f the algorithm could be close to

0 { n ) since the frequent combinations of the elements in a traversal keep the traversals

relatively short. Another feature of this algorithm is that the traversal it finds for a

subtree T' is not only optimal for T' but must also appear as a subsequence in some

optimal traversal for any larger tree that contains T' as a. subtree. For example, E.seq

is a subsequence in E.seq, which is in turn a subsequence in I.seq (see Figure 3.10(b)).

3.4.3 Correctness of the Algorithm

We now show the correctness of the algorithm. The first lemma establishes some

important properties about the elements in an ordered list v.seq that represents a

traversal.

L em m a 1 Let v be any node in an expression tree, S = v.seq, and P be the traversal

represented by S of the subtree rooted at v, i.e., P = S[l].nodelist-l t-FQFIj.nodeZist.

The algorithm maintains the following invariants:

For all 1 < 2 < |F |, let S[i\.nodelist = (vi ,V2 , . . ■ ,Vn) and Vm be the

last node in S[i].nodelist that has the maximum himem value, i.e., for all

74

k < m , himem(%, P ) < himem(î;,„, P) and for ail k > m, himem(%T P ) <

himem(uni, P ) . Then, we have,

1. S[i].hi = him em (Urn, P ) ,

2. S[i]Jo = lom em (fn, P ),

3. for all m < A: < n, lomem(i;fc, P ) > lomem(u„, P ),

4. for all 1 < i < i,

(a) for all 1 < A: < n, > himem(%, P ) ,

(b) for all 1 < A; < n, S[j].lo < lomem(%, P ),

(c) > 5[z]./iz, and

(d) 5[j]./o < 5[z]./o.

P roof

The above invariants are true by construction. ■

The second lemma asserts the ‘indivisibility’ of an indivisible unit by showing that

unrelated nodes inserted in between the nodes of an indivisible unit can always be

moved to the beginning or the end of the indivisible unit without increasing memory

usage. This lemma allows us to treat each traversal as a sequence of indivisible units

(each containing one or more nodes) instead of a list of the individual nodes.

L em m a 2 Let u be a node in an expression tree T, S = v.seq, and P be a traversal

of T in which the nodes from S[i\.nodelist appear in the same order as they are in

S[i\.nodelist, but not contiguously. Then, any nodes that are in between the nodes in

S[i].nodelist can always be moved to the beginning or the end of S[i].nodelist without

75

increasing memory usage, provided that none of the nodes that are in between the

nodes in S[t].nodelist are ancestors or descendants of any nodes in S[i].nodelist.

P ro o f

Let S[i].nodelist = Vq be the node before uj. in 5 , Vm be the node

in S[i].nodelist such that for all k < m, himem(ut, P ) < himem(u,„, P ) and for

all k > m, himem(Urn,5") > him em (ufc,5). Let be the 'foreign’ nodes,

i.e., the nodes that are in between the nodes in S[i\.nodelist in P, w ith u[ , . . . , u '

(not necessarily contiguously) before and . . . , (not necessarily contigu

ously) after Vm in P. Let Q be the traversal obtained from P by removing the

nodes in S[i].nodelist. We construct another traversal P' o f T from P by moving

to the beginning of S[i].nodelist and . . . , to the end of S[i].nodelist.

In other words, we replace { v i , . . . , . . . , . . . , Vm, ■ ■ ■-. . . . , Uj, . . . , u„) in P

with { v [ , . . . , u' , u i , . . . , . . . , Ug+i, - - . , Ufi, ) to from P ' .

P and P' differ in memory usage only at the set of nodes (u i , . . . , V n , v \ , . . . ,

P ' does not use more memorj^ than P because:

1. The memory usage of Vm is the same in P and P' because himem(u,n, P ) =

himem(um, P') = himem(um, S) -I- lomem(u^, Q).

2. For all 1 < A: < n, since himem(u^, S) < himem(u^, 5 ) , we have himem(u^, P') =

himem(%, S) -hlomem(u^, Q) < himem(uni, P ) = himem(um, 5 ) 4-lom em (u', Q).

3. Since for all 1 < A: < m, lom em (uo,5) < lomem(ut. S') (by invariant 4(b)

in Lemma 1), we have, for all 1 < j < a, himem(z;', P') = himem(u^, Q) -f-

lomem(uo, S) < himem(u'-, P ) .

76

4. Since for all m < A: < n, lomem(t%, S') > lomem(u„, S) (by invariant 3 in

Lemma 1), we have, for all a < j < b, himem(t?'-, P') — himem(u'-, Q) +

lomem(z;„,S) < himem(u' , P ).

Since the memory usage of any node in v i , . . . ,Vn after moving the foreign nodes

cannot exceed that of Vm, which remains unchanged, and the memory usage of the

foreign nodes does increase as a result o f moving them, the overall maximum memory

usage cannot increase. ■

The next lemma deals with the ordering of indivisible units. It shows that ar

ranging indivisible units in the order of decreasing hi-lo difference minimizes memory

usage. This is because two indivisible units that are not in that order can be inter

changed in the merged traversal without increasing memory usage.

L em m a 3 Let v and v' be two nodes in an expression tree that are siblings of each

other, S = v.seq, and S' = v'.seq. Then, among all possible merges of S and S',

the merge that arranges the elements from S and S' in the order of decreasing hi-lo

difference uses the least memory.

P roof

Let M be a merge of 5 and S' that is not in the order of decreasing hi-lo difference.

Then there exists an adjacent pair o f elements, one from each of S and S', that are

not in that order. W ithout loss o f generality, we assume the first element is S'[j] from

S' and the second one is S[z] from S. Consider the merge M' obtained from M by

interchanging P'[j] and S[z]. To simplify the notation, let Hr = S[r]./iz, Lr = 5[r]./o,

H'j. = S'[r].hi, and = S'[r].lo. The memory usage of M and M' differs only at S'[j]

and 5[i] and is compared in Figure 3.11.

77

element himem lomem

i ;

S V ] Hj 4- Li—i L'i 4-5'hl Hi 4- L' Li + L'i

;

element himem lomem:

5[z] Hi + Li 4-H'i -h Li L'i 4- Li

: \

(a) Sequence M (b) Sequence M'

Figure 3.11: Memory usage comparison of two traversals in Lemma 3.

The memory usage of M at the two elements is m ax(i7' + Hi + L'-) while the

memory' usage o f M ' at the same two elements is m ax(iif'+L i, Hi+L'^_]) Since the two

elements are out of order, the hi-lo diflference of S'[j] must be less than that of 5[z],

i.e., H'j — L'j < Hi —Li- This implies Hi-\-L'j > H' + Li. Invariant 4 in Lemma 1 gives

us Lj > L '_ i, which implies Hi + L'- > Hi + L'_^. Thus, m ax(iJj 4- L j-i, Hi + L'-) >

m ax(iïy + Li, Hi + Therefore, M' cannot use more memory than M . By

switching all adjacent pairs in M that are out of order until no such pair exists, we

get an optim al order without increasing memory usage. ■

T h e o r e m 4 Given an expression tree, the algorithm presented in Section 3.4.2 com

putes a traversal that uses the least memory.

P r o o f

We prove the correctness of the algorithm by describing a procedure that trans

forms any given traversal to the traversal found by the algorithm without increase in

memory usage in any transformation step. Given a traversal P for an expression tree

T, we visit the nodes in T in a bottom-up manner and, for each non-leaf node v in

T, we perform the following steps:

78

1. Let T' be the subtree rooted at v and P' be the minim al substring of P that

contains all the nodes from T' — {v } . In the following steps, we will rearrange

the nodes in P' so that the nodes that form an indivisible unit in v.seq are

contiguous and the indivisible units are in the same order as they are in v.seq.

2. First, we sort the components of the indivisible units in v.seq so that they are in

the same order as in v.seq. The sorting process involves rearranging tw o kinds

of units. The first kind of units are the indivisible units in u.seq for each child

u of V. The second kind of units are the contiguous sequences o f nodes in P'

which are from T — T'. For this sorting step, we temporarily treat each such

maximal contiguous sequence of nodes as a unit. For each unit E of the second

kind, we take E .h i = max^e£;himem(ic, P ) and E.lo = lomem(wn, P ) where Wn

is the last node in E. The sorting process is as follows.

W hile there exist two adjacent units E' and E in P' such that E' is

before E and E'.hi — E'.lo < E .h i — E.lo,

(a) Swap E' and E. By Lemma 3, this does not increase the memory

usage.

(b) If two units of the second kind becomes adjacent to each other

as a result o f the swapping, combine the two units into one and

recompute its new hi and lo.

When the above sorting process finishes, all units of the first kind, which are

components o f the indivisible units in v.seq, are in the order of decreasing hi-lo

difference. Since, for each child u of v, indivisible units in u.seq have been in the

correct order before the sorting process, their relative order is not changed. The

79

order of the nodes from T — T' is preserved because the sorting process never

swaps two units o f the second kind. Also, v and its ancestors do not appear

in P', and nodes in units of the first kind are not ancestors or descendants of

any nodes in units of the second kind. Therefore, the sorting process does not

violate parent-child dependences.

3. Now that the components of the indivisible units in v.seq are in the correct

order, we make the indivisible units contiguous using the following combining

process.

For each indivisible unit E in v.seq,

(a) In the traversal P , if there are nodes from T — T' in between the

nodes from E, move them either to the beginning or the end of

E as specified by Lemma 2.

(b) Make the contiguous sequence of nodes from E an indivisible

unit.

Upon completion, each indivisible unit in v.seq is contiguous in P and the order

in P of the indivisible units is the same as they are in v.seq. According to

Lemma 2, moving ‘foreign’ nodes out of an indm sible unit does not increase

the memory usage. Also, the order o f the nodes from T —T' is preserved. Hence,

the combining process does not violate parent-child dependences.

We use induction to show that the above procedure correctly transforms any

given traversal P into an optimal traversal found by the algorithm. The induction

hypothesis H{u) for each node u is that:

80

• the nodes in each indivisible unit in u.seq appears contiguously in P and are in

the sam e order as they are in u.seq, and

• the order in P o f the indivisible units in u.seq is the same as they are in u.seq.

Initially, H{ u) is true for ever}' leaf node u because there is only one traversal order

for a leaf node. As the induction step, assume H{u) is true for each child u of a

node V. The procedure rearranges the nodes in P' so that the nodes that form an

indm sible unit in v.seq are contiguous in P, the sets of nodes corresponding to the

indivisible units are in the same order in P as they are in v.seq, and the order among

the nodes that are not in the subtree rooted at v is preserved. Thus, when the

procedure finishes processing a node v, H{v) becomes true. By induction, H{T.root)

is true and a traversal found by the algorithm is obtained. Since any traversal P can

be transformed into a traversal found by the algorithm without increasing memor}'

usage in any transformation step, no traversal can use less memory and the algorithm

is correct. ■

3.5 Dynam ic Memory Allocation

Under dynamic memory allocation, space for arrays can be allocated and deallo

cated as needed. We allow an array to be allocated/ deallocated multiple times by

putting the allocate/deallocate statem ents inside some loops. But each array must

be allocated/ deallocated in its entirety. We do not consider sharing of space between

arrays. Given a loop fusion configuration, the positions of the allocate/deallocate

statem ents are determined as follows. Let v be an array, Se be the statement that

evaluates v, be the statement that uses v, and the t-loop be the innermost loop

that contains both Se and s^. The latest point an “allocate v" statement can be

81

placed is inside the t-Ioop and before Sg and any loops that surround Sg and are in

side the t-Ioop. The earliest point a "deallocate u" statem ent can be placed is inside

the t-loop and after and any loops that surround and are inside the t-loop. The

memory usage of a loop fusion configuration is no longer the sum of the array sizes,

but maximum total size o f the array that are allocated at any time.

Another difference between dynamic allocation and static allocation is that, with

dynamic allocation, the evaluation order o f the nodes having equal fusion with their

parents can affect memory usage although the sizes of individual arrays remain un

changed. Consider the fusion graph shown in Figure 3.12(a). Since / i and /s are

equally fused with f^, either of the two subtrees rooted at / i and fs can be evaluated

first. Figure 3.12(b) and (c) show the loop fusion configurations for the two evaluation

orders. .Assume the same index ranges Ni = 500, Nj = 100, Nk = 40, and N[ = 15

as before. In (b), the maximum memory usage is 4141 elements when f i , fs , / i , and

/g are allocated during the evaluation of / t and /g. But in (c), when f i is being

evaluated, .4, / i , and /g coexist in memory and their total size is 4600 elements.

Therefore, in addition to finding the fusions between children and parents, we

need to determine the evaluation order of the nodes that minimizes memory usage.

A simpler problem of finding the memory-optimal evaluation order of the nodes in an

expression tree without any fusion has been addressed in Section 3.4. Here, we apply

its result to each set of nodes having equal fusion to obtain the optimal evaluation

order among them.

We use the same bottom -up, dynamic programming algorithm as in Figures 3.3 to

3.6 to traverse the expression tree and enumerate the legal fusions for the nodes. But

the procedures related to calculations of memory usage (namely, In itM e m U sa g e ,

82

f o

h

Ok

h

AT-3

Bk I k l

C

a l l o c & i n i t f l C N j ] f o r j

a l l o c A [ N i ] f o r i[ A C i ] = g e n A ( i , j ) f o r i[ f l [ j ] + = A [ i ] f r e e A [ N i ]

a l l o c & i n i t f 3 [ N j , N k ] f o r j

a l l o c C [N 1 ] f o r 1[ C [ l ] = g e n C ( k , l ) f o r k

a l l o c B [N 1 ] f o r 1[ B [ l ] = g e n B ( j , k , l ) f o r 1

a l l o c f 2 f 2 = B [ l ] x C [ l ] f 3 [ j , k ] + = f 2 f r e e f 2

f r e e B [N 1 ] f r e e C [N 1 ]

a l l o c & i n i t f 5 [ N k ] f o r j

f o r ka l l o c f 4f 4 = f l [ j ] x f 3 [ j , k ] f 5 [ k ] + = f 4 f r e e f 4

f r e e f l [ N j ] , f 3 [ N j , N k ]

a l l o c k i n i t f 3 [ N j , N k ] f o r j

a l l o c C [N 1] f o r 1[ C [ 1 ] = g e n C ( k , 1 ) f o r k

a l l o c B [N 1] f o r 1[ B [ l ] = g e n B C j , k , l ) f o r 1

a l l o c f 2 f 2=B [1 ] x C [ l ] f 3 [ j , k ] + = f 2 f r e e f 2

f r e e B [N 1] f r e e C [N 1]

a l l o c k i n i t f 1 [ N j ] f o r j

a l l o c A [ N i ] f o r i[ A [ i ] = g e n A ( i , j ) f o r i[ f l [ j ] + = A [ i ] f r e e A [ N i ]

a l l o c k i n i t f 5 [ N k ] f o r j

f o r ka l l o c f 4f 4 = f l [ j ] x f 3 [ j , k ] f 5 [ k ] + = f 4 f r e e f 4

f r e e f l [ N j ] , f 3 [ N j , N k ]

( a ) (b) (c )

Figure 3.12: A fusion graph with equal fusions and its two evaluation orders.

8 3

InitMemUsage (s): s.seqset = 0

AddMemUsage (u ,/ , size, s l . s ) :

s.seqset = sl.seqset X = Set (/)foreach x' E s.seqset s.t . x C x' in decreasing \x'\

CoUapseSeq (s.seqset,x',x)If z 0 s.seqset then

x.seq = 0s.seqset = s.seqsetU {x} base = 0

elsebase = x.seq\\x.seq .lo

AppendSeq (x.seq, (v),size + base, size)MergeMemUsage ( s l , s 2 , s ) ;

s.seqset = sl.seqset foreach i ' E s2.seqset

if 3x E s.seqset s .t . x = x' then MergeSeq (x.seq, x'.seq)

elses.seqset = s.seqsetiJ {x'}

CoUapseSeq (QQ,x',x):if 3y E QQ s.t. x Ç y C x ' then

y = NextSmaUer (QQ,x') base = y .sec^y .seq\].lo

else y = x y.seq = () base = 0

for 2 = 1 to \x'.seq\AppendSeq (y.seq,x'.set i].nodelist,x'.se(j i\.hi + base,x'.sei ij.lo + base)

Figure 3.13: Algorithm for dynamic memory allocation.

8 4

MergeSeq (Q1.Q2):0 = 0Z = J = 1basel = base2 = 0 while i < |Q1| or j < \Q2\

if j > \Q2\ or (z < |Q1| and Q\[i].hi— Ql[z']./o > Q2\j\.hi — Q2\j].lo) then AppendSeq {Q,Ql\i\.nodelist,Q\.[i\.hi+ basel, Ql\i\.lo + basel) base2 = Ql[z']./o Z++

elseAppendSeq {Q, Q2{j].nodelist, Q2\j\.hi 4- base2, Q2\j].lo 4- base2) basel = Ql[j].loy++

end while return Q

AppendSeq (Q, nodelist, hi, la):E = {nodelist, hi, lo) / / new element to append to Qi = \Q\while z > 1 and {E.hi > Q[i\.hi or E.lo < Q\i\.lo)

E = {Q ].nodelist 4- E.nodelist, max(Q[z]./zz, E.hi), E.lo) remove Q[i] from Q / / Q\i\ is combined into E i —

end whileQ = Q + E / / IQI is now z 4-1

Inferior (s, s') = s.nesting Ç s'.nesting and InferiorSeqSet {s.seqset, s'.seqset)

InferiorSeqSet {QQ, QQ') =Vz/, High iQQ,y) > High (QQ',y) and Low (QQ,y) > Low (QQ',y)

High (QQ,y) = x.seç[l]./zz’4-Low {QQ,x')where x - - NextSmaller {QQ,y) and x' = NextSmaller {QQ,x')

Low {QQ,y) = 6 QQ and I Ç ÿ) x.se4\x.seq\].lo

NextSmaller {QQ,y) = max{% 6 001/ C z/}

Figure 3.14: Algorithm for dynamic memory allocation (cont.)

8 5

A d d M e m U sa g e , M e r g e M e m U sa g e , and In fer io r) are replaced by those in Fig

ures 3.13 and 3.14.

To correctly calculate memory usage and to determine the optim al evaluation

order o f nodes with equal fusion, instead of a m em field, each solution for a node

V now has a seqset field, which is a set of fusion indexsets Set { f ) for the fusions /

in the subtree rooted at v. Each fusion indexset x in s.seqset is associated with an

x.seq^ which is a sequence of indivisible units. Each indivisible unit is a {nodelist,

hi, Id) triple, where nodelist is an ordered list of nodes, hi is the maximum memory

usage during the evaluation of the nodes in nodelist, and lo is the memory' usage after

those nodes are evaluated. The nodes in a nodelist have the special property that

there necessarily exists some globally optim al traversal of the entire tree wherein this

sequence appears undivided. Therefore, inserting any node in between the nodes of

an indivisible unit does not lower the total memory usage.

The seqsets and seqs are manipulated as follows. When two solutions from two

children nodes of the same parent are merged together, their seqsets are merged by

the M e r g e M e m U sa g e procedure. A seq for a fusion indexset that appears in only

one seqset is sim ply copied to the result seqset. If a fusion indexset x appears in both

seqsets, the indivisible units in the two seqs for x are “merge-sorted” together in the

order of decreasing hi-lo difiFerence (by the M er g e S e q procedure) to form a combined

seq and their hi-lo values are adjusted. In Section 3.4, we have proved that arranging

the indivisible units in this order minimizes memory usage. Moreover, the indivisible

units in a seq must appear in the same order in som e optimal traversal of the entire

expression tree.

86

V f u s i o nV.seqset

X x.seqA (;) { j } (((A), 500,500))B ( jk) {j-.k} (((B), 15,15))C {k) { k } (((C), 15,15))f i 0 0 ( ( (A ,/ i ) ,6 0 0 ,100))/ 2 { h j , l ) { t } ( ( (0 ,1 5 ,1 5 ) )

k) (((B), 15,15))O ' , k , l } ( ( ( A ) , 1 , 1 ) )

/ s 0 0 (((C ,B,/2 ,/3),4031,4000))h 0 (((A, / i , C, B , /2, /s ) , 4131,4100))

0 , k} ( ( ( A ) , 1 , 1 ) )h 0 0 ( ( (A ,/ i ,C ,B ,/2 , /3 ,A ,/3 ) ,4 1 4 1 ,4 0 ))

Table 3.3: The seqsets for the fusion graph in Figure 3.12(a).

When a solution is extended to include a new node v having fusion / with its

parent, the seqs in v.seqset for fusion indexsets larger than Set { f ) are collapsed into

the seq for Set { f ) (by the C o lla p se S e q procedure). In collapsing a seq Q into another

seq Q', all the indivisible units in Q are appended to Q' after their hi-lo values are

adjusted. When all collapsings are done, all fusion indexsets in v.seqset are subsets

of Set { f ) . Then, a new indivisible unit for v is appended to the seq for Set{f) .

Whenever an indivisible unit E is appended to a seq Q by the A p p en d S e q

procedure, it is combined with indivisible units at the end of Q whose hi is not higher

than E.hi or whose lo is not lower than E.lo. The combined indivisible unit has the

concatenated nodelist and the highest hi but the original E.lo. An indivisible unit

initially contains a single node. But as the algorithm traverses up the tree, indivisible

units may be combined to form ones that contain more nodes.

87

As the procedures for enumerating legal fusions are the same for both static and

dynamic memory allocation, we will use a particular fusion graph, the one in Fig

ure 3.12(a), to illustrate how memory usage is maintained by the algorithm under

dynamic allocation. For node A, the fusion is ( j ) and the fused array size is iV, — 500.

So, A.seqset has a single fusion indexset { j } whose associated seq is (((A ), 500,500)).

The seqsets for B and C are similarly obtained. Since / i is not fused with / i , the

A d d M e m U sa g e procedure collapses the only seq in A.seqset from fusion indexset

{ j } to 0. When a new indivisible unit ( ( / i ) , 600,100) for f i is appended to the seq for

0, it is combined with the existing indivisible unit into ((A, / i ) , 600,100). Now, A and

f i becomes indivisible. For node /o , the seqsets for B and C (which share no common

fusion indexset) are first merged together by M er g e M e m U sa g e and then a new seq

for { j , k , l } = Set { {k, j , l ) ) is created for /g- Thus, f 2 -seqset has 3 seqs. To form the

seqset for /a, which is not fused with the A d d M e m U sa g e procedure first col

lapses the 3 seqs in f 2 .seqset into (((C , B ,/a ) , 31 ,31)), and then appends to it a new

indivisible unit ((/a), 4031,4000) to form the single seq (((C , 5 , / a , /a ) , 4031,4000))

in fs.seqset. The seqsets for all the nodes are shown in Table 3.3. The single seq

in the root node’s seqset contains the optimal evaluation order of the nodes and the

hi value of the first indivisible unit in that seq is the maximum memory usage (see

Figure 3.12(b)).


A formula sequence for a multi-dimensional summation with common sub-expressions

may need a directed acyclic graph (DAG) representation instead of an expression tree.

88

The construction of a potential fusion graph G for a DAG is as follows, which is dif

ferent from that for a tree described in Section 3.1.

1. For each node u in the DAG, create in G a set of vertices, one for each dimension

of array u. If u is a summation node, add a vertex for the summation index.

2. In each formula in the given formula sequence, if dimension d or the summation

index of the result array u shares the same index variable with dimension d'

of an operand array u', then connect in G the vertex for dimension d or the

summation index of node u and the vertex for dimension d' of node u' with a

potential fusion edge.

As an example, consider the multi-dimensional summation in Figure 3.15(a), in

which arrays A and B appear twice. .A. formula sequence for computing the summa

tion and its DAG representation are shown in Figure 3.15(b) and (c), respectively.

Figure 3.15(d) shows a potential fusion graph constructed as explained above for the

DAG.

Notice that, unlike a fusion graph for a tree, the vertices in a (potential) fusion

graph for a DAG are labeled not by index variable names but by dimension numbers.

This is because multiple references to an array may have different index variables,

which makes index renaming possible and defies a unique mapping between array

dimensions and index variables.

A fusion graph is a potential fusion graph with a subset of the potential fusion

edges turned into fusion edges. However, some fusion graphs are illegal, i.e., they do

not correspond to loop fusion configurations that correctly compute the final result.

A fusion graph G for a DAG is legal if it satisfies the following requirements.

8 9

X B[Lj\ X A\j,k] X B\j,k] x C\j,k])

(a) A multi-dimensional summation with common sub-expressions

h \j] =h [7 , k] = / i \j, fc] X C\j, Â:]AM = Hkfzlj.k]

= h \ j ] = AM X AM

(b) A formula sequence computing (a)

A X

Èfc A

A E i X /3 A

A

A •’• . IS

: : AE 1’A

A ÿ •> : . C12

12g

I 2

A X CM 6]

A[i,j] B[i,j]

(c) A DAG representation for (b) (d) A potential fusion graph for (c)

Figure 3.15: An example multi-dimensional summation with common sub-expressions and representations of a computation.

90

I/s J.

If-o •-

/ s •. 1 E

: - 4 - AS 1/ E I-

h y y.2 J3 A V-. y s' /s

C ^ c1 2 I 2

A \ ^ B A ^ B1 2 2 1 2 2

(a) (b)

I 1 1/5 A ^ h /-

L E / '. I S

/ ' : A r h / > /4E 1/ S l/

/2 3 ^ - /a A Vy ? h / 2 Vy .2 /S

' ' G /i c Cy / \ \ I 2 I 2 I 2

A ^ B A *• s A / y w B12 12 12 12 12 I :

(c) (d) (e)

Figure 3.16: Examples of illegal fusion graphs for a DAG.

91

1. The scopes of any two fusion chains have to be disjoint or a subset/superset

of each other. This was the only requirement on a legal fusion graph for a

tree (see Section 3.1). For a DAG, it also means that the scopes of fusion

chains along different paths from a multi-parent node v to DOM(u) must not

partially overlap. Figure 3.16(a) shows an example fusion graph that violates

this requirement. It is illegal because the two loops for the two fusion chains

are nested but none of them can be the outer loop.

2. A multi-parent node v must have the same fusion with all its parents. In other

words, each dimension of v must be fused with either all the parents of v or

none of them. This requirement ensures v is computed only once and has only

one size. For example, in Figure 3.16(b), the second dimension of f i is fused

with /a but not /g- So, the fusion graph is illegal.

3. If a multi-parent node v and an ancestor v' o f v are in the scope of the same

fusion chain c, then any node u that is an ancestor of v and a descendant of v'

must also be in the scope of c. If u is not in the scope of c, u would be outside the

loop that corresponds to c and contains both v and v'. But the evaluation of u

can neither be before nor after the loop because u depends on v and v' depends

on u. This requirement holds even if no potential fusion edge(s) exists to extend

the scope of c to include u. The example fusion graph in Figure 3.16(c) does

not satisfy this requirement since / i , /g , fz, and /a are in the scope of a fusion

chain, but is not.

92

fô

h

.4 I 2 1 2

C

initialize f2 initialize f4 for i

for jA=genA(i,j) B=genB(i,j)C=genC(i,j) fl=AxB f2[j]+=fl f3=flxC f4[i]+=f3

for j[ f5[j]=f2[j]xf4Cj]

A B

for i fo r j

A=genA(i,j)B=genBCi,j)f lC i. j]=AxB

fo r ji n i t i a l i z e f2 for i[ f2+ = fl[ i . j ] i n i t i a l i z e f4 for k

C=genCCi,j) f3=fl[ j ,k ]xC f4+=f3

fS[j]=f2xf4

(a) (b)

Figure 3.17: Examples of legal fusion graphs and corresponding loop fusion configurations for a DAG.

4. A fusion chain must not connect two different vertices o f the same node. For

instance, the fusion chain in Figure 3.16(d) connects both vertices of f i . Hence,

the fusion graph is illegal.

5. If a node u and its parent u' are in the scope o f the same fusion chain c, then

there must be either a fusion edge in c between u and u', or a potential fusion

edge that connects the two vertices in u and u' that c connects. Thus, the

fusion graph in Figure 3.16(e) is illegal since / 4 and /s are in the scope of a

fusion chain, but there is no fusion edge or potential fusion edge that connects

the sum m ation vertex of and the only vertex o f f^.

Figure 3.17 shows two legal fusion graphs and their corresponding loop fusion config

urations.

93

A is denseSparsity entry in B and f i :

(*, 7 , 0 .1 )

(a) (b)

for i for j[ ACj]=genACi,j) for j€nonzerosl(i) r B=genB(i,j)L flCi, xB

(c)

Figure 3.18: An example of legal loop fusions for sparse arrays.

3.7 Sparse Arrays

As discussed in Section 2.7, sparsity in an input or intermediate array can be

represented as a list of sparsity entries. Each sparsity entry consists of the two

dimensions of the array involving the sparsity and a sparsity factor, which equals

the fraction of non-zero array elements. Since we assumed earlier when we count

arithmetic operations for sparse arrays that only non-zero elements in result arrays

are formed, the loops corresponding to sparse dimensions of a result array should

iterate over non-zero elements only. For a pair of loops to be fusible among a set of

nodes, the two loops must have compatible loop ranges, i.e., they have to iterate over

the same number of non-zero elements in each of those nodes. This leads to a new

requirement on a legal fusion graph: W hen a pair o f loops are fused among a set of

nodes, the sparsity factor between the two fused dimensions have to be the same for

all those nodes.

For instance, in the formula /i[z , k] — A[i , j ] x B [ i , j ] , suppose that A is dense and

B, and hence / i , have a single sparsity entry of (z, j , 0.1). We can fuse both the i- and

j-loops between B and / i and one of the two loops between A and f \ . Figure 3.18

shows how the fusion graph and the corresponding loop fusion configuration would

94

(a) (b)

Figure 3.19: Illegal fusion graphs due to representations of sparse arrays.

look like, in which n o n z e r o s l ( i ) is the set of j values such that is non-zero.

However, fusing both loops between A and f i is illegal because they have different

sparsity factors (and we assume the entire array .4 needs to be formed for other

purposes).

Internal representations o f sparse arrays could impose another constraint on legal

fusion graphs. If the representation allows sparse arrays to be accessed in any order

of the array dimensions efficiently, loops around the production or the consumption

of sparse arrays can be permuted freely to facilitate loop fusions. Otherwise, only a

single or a limited set of loop permutations around sparse arrays would be permitted,

thus restricting how loops can be fused. For example, if a two-dimensional sparse

array C must be accessed in row-major order, then the fusion graph in Figure 3.19(a)

would become illegal because it makes j , the second dimension of C, the outer loop.

Also, if a sparse array /a can be accessed in only one order, the fusion graph in

Figure 3.19(b) would again be illegal since /a is produced row-wise but consumed

column-wise.

95

fe

Sparsity entries in Y:

{j\k,S2) Y{ k j . s s )

(a)

for ifor jenonzerosl(i)

i J k l r for kenonzeros2(j)for l€nonzeros3(k)[ Y[k,l]=genYCi,j,k,l)

for kEnonzerosZCj) i j k I for 16nonzeros3(k)

[ f6 [ i , j .k ,l ]+= Y [k .l]

(b) (c)

Fused array sizeo f y =N k X N i X S2 X S3

(d )

Figure 3.20: Fused size of sparse arrays.

Fusing a loop, say a (-loop, between a node v for a sparse array and its parent

eliminates the (-dimension in the sparse array v, but does not always reduce the

array size (i.e., the number of non-zero elements^) by a factor of Nt- As an example,

consider the formula /s j , in which array Y has sparsity entries

{hJ: 5i ) , {j, k, S2) , and {k, I, S3) . If we choose to fuse only the i- and j-loops between

F and /e , the sparsity graph and the loop fusion configuration would be as shown in

Figure 3.20. Here, the fused size (i.e., the size after loop fusion) of F is the number of

non-zero elements in F when i and j are fixed, i.e., the number of non-zero elements

in the slice Y [ i , j , *, *]. In general, the fused size of a sparse array v equals the number

of non-zero elements in v when all the fused dimensions have fixed values. This fused

size can be computed as the product of the ranges of all unfused dimensions of v

tim es the product of the sparsity factors in the sparsity entries of v that involve at

least one unfused dimension. Formally, it can expressed as:

( J J N i ) X e. sparsity factor)i 6 V . dim ens — Set ( /) e&L

"Depending on the internal representation of sparse arrays, the actual memory usage could be higher.

96

fr FFTi iJ

X[K.i ] exp[î,j] Y

for K ■ for i

[ X[i] = ...fr[l :Nj]=fft(X[l:Ni] ) for j

. [ ...= fr[j]...Ki

(a) (b) (c)

Figure 3.21: Loop fusions for an FFT node.

where / is the fusion between v and its parent, and

L = (e E V.sparsityentries | e .d im l 0 Set ( / ) or e.dim2 0 Set ( / ) }

In our example, the fused size of Y is Nk x N[ x s ? x S3 .


The general form of an FFT formula is fr [K , j ] = z] xexp[z, j] where AT is a

set of indices, j ^ K and exp[z, j] = Section 2.8). This formula can be

represented by an FFT node as shown in Figure 3.21(a). We assume that the FFTs are

formed by calling library routines, which com pute the exponential functions needed.

Thus, we do not consider the memory usage by exponential functions. Furthermore,

since the FFT library routines transform one or more entire rows of X to an equal

number of rows of fr at a time, the i- and j-loop s at an FFT node cannot be fused

with its children or its parent; only the loops for the indices in K can be fused.

Figure 3.21(b) and (c) show the fusion graph and the loop fusion configuration when

the loops in K are fused.

97

3.9 Further Reduction in M emory Usage

So far, we have been restricting our attention to reducing memory usage without

any increase in the number of arithm etic operations. However, w ith this restriction,

the optimal loop fusion configuration for some formula sequences may still require

more than the available memory. The existence o f common sub-expressions, sparse

arrays, and fast Fourier transforms in a formula sequence may prohibit the fusion of

some loops, thereby imposing high lower bounds on the sizes o f some intermediate

arrays. As a result, it may be necessary to relax the restriction on operation count,

for the implementation of some formula sequences to be feasible in terms of memory

usage.

As we have explained in Section 1.3, optimizing for operation count, memorj" us

age, and communication cost together in an integrated manner is intractable. For the

same reasons, we only consider the operation-optim al formula sequence when trading

operation count for memorj^ usage although another formula sequence may lead to

a better solution. In other words, when it is necessary to relax the restriction on

operation count to further reduce memory usage, our approach no longer guarantees

any optimality. Nevertheless, since we start with an operation-optim al formula se

quence, we expect the operation-relaxed solutions obtained to have good performance

in terms of memory usage and operation count.

We propose the following transformations to the formula sequence in question and

the corresponding potential fusion graph for further memory reduction at a cost of

increased arithm etic operations. The resulting formula sequence and the potential

fusion graph are then fed into one of the memory usage minim ization algorithms

described in Sections 3.3 and 3.5 to determine the optimal loop fusion configuration.

98

Creating additional loops around some assignment statements may enable more

loop fusions and thus reduce array sizes. For example, in Figure 3.2(b), if

we create an 2-loop around / t and and fuse it with the z-loop around .4

and f i , we can elim inate the z-dimension of / i and make it a scalar. This

increases the operation counts for and fs by a factor of Ni because they are re

compute Ni times. In general, we add additional vertices to the potential fusion

graph and connect each new vertex of a node v and the corresponding vertex

at the parent of v with an additional potential fusion edge. If the additional

potential fusion edge is converted into a fusion edge in a fusion graph, the

operation count for node v is multiplied by Ni, where i is the loop index for the

fusion edge. Note that if we place no limit on the operation count, the memory

usage minimization problem would have a trivial solution, which is to put all

the assignment statem ents inside a perfect loop nest. Doing so would reduce

all intermediate arrays to scalars but the operation count would return to its

original unoptimized value.

For a formula sequence w ith common sub-expressions, the fusion-preventing

constraints in Section 3.6 can be overcome by recomputing the common sub

expressions, once for each use o f a common sub-expression. We transform the

DAG and its potential fusion graph by splitting an n-parent node v (for n > I)

into multiple nodes v i , . . . ,Vn and make Vk the child of the k-th parent of v.

The subtrees or sub-DAGs rooted at v are also split into n copies. In a fusion

graph, the split nodes can be united into one if the fusions in their subtrees

or sub-DAGs are the same and if doing so does not violate the constraints in

Section 3.6. The operation count for node u or a descendant v' o f v is multiplied

99

by the number of split nodes for v or v' remaining after the split nodes are united,

if possible.

• As mentioned in Section 3.8, the loops for the two indices in the exp function

of an FFT node v cannot be fused with the corresponding loops in the parent

or the children of v. We can convert the FFT node back into a multiplication

node and a summation node to allow more possible loop fusions. This may

result in an increase in the operation count, which can be calculated according

to Section 2.2.

The memory usage minimization algorithms can be modified to maintain the

additional arithmetic operations for each solution. A solution s is inferior to another

solution s' if, in addition to existing criteria, the additional operations for s is greater

than or equal to that for s'. Any solution with a cumulative memory usage higher than

a user-specified limit can be pruned. At the end, the algorithms return a unpruned

solution for the root node that has the fewest additional arithmetic operations.

3.10 An Example

We illustrate the practical application of the memory usage minimization algo

rithm on the example multi-dimensional summation described in Section 2.9. The

optim al formula sequence for the summation has a cost of 1.89 x 10^ operations and

is reproduced below. The DAG representation of the formula sequence is shown in

Figure 3.22. In this example, array Y has a sparsity of 0.1 and the ranges of the

indices are as given in Section 2.9. W ithout any loop fusion, the total size of the

arrays is 1.13 x 10 elements.

fl[r,ElL,RLl,t] = Y[r,RL] * G[RLl,RL,t] cost= le+12 <r,RL,0.1>

1 0 0

h o FFTrl

/i3 FFTr exp[Gl,rl]

/ i i X exp[G, r]

/lO X X / 7

<>/ e T .R L 2 exp[fc ,r] e x p [fc ,r l]

f s X

[r,RL2]

[rl,RL2,t]

' H r l A

X / l

r[r, RL] G[RL1, RL, t]

Figure 3.22: The DAG representation of an exam ple formula sequence.

1 0 1

f2[r,RLl,t] = sum RL fl[r,RL,RLl,t] cost= 9.9e+ll dense f5[r,RL2,rl,t] = Y[r,RL2] * f2[rl,RL2,t] cost= le+14 <r,RL2,0.1>f6[r,rl,t] = sum RL2 f5[r,RL2,rl,t] cost= 9.9e+13 dense f7[k,r,rl] = exp[k,r] * exp[k,rl] cost= 0 denseflO[r,rl,t] = f6[r,rl,t] * f6[rl,r,t] cost= le+12 densefll[k,r,rl,t] = f7[k,r,rl] * flO[r,rl,t] cost= le+13 densef13[k,rl,t,G] = fft r fll[k,r,rl,t] * exp[G,r] cost=l.66G964e+15 dense f15[k,t,G,G1] = fft rl f13[k,rl,t,G] * exp[Gl,rl] cost=l.660964e+13 dense

Notice that the common sub-expressions F , exp, and /g appear at the right hand

side of more than one formula. Also, / 1 3 and / 1 5 are FF T formulae. As explained

in Sections 3.6, 3.7, and 3.8, if each array is to be computed only once, the presence

of these common sub-expressions and FFTs would prevent the fusion of some loops,

such as the r and r l loops between /g and /iq . Under the operation-count restriction,

the optimal loop fusion configuration obtained by the memory* usage minimization

algorithm for static memory allocation requires memory storage for 1 . 1 0 x 1 0 ^ array

elements, which is 1000 tim es better than without any loop fusion. But this translates

to about 1 1 0 gigabytes and probably still exceeds the amount of memory available in

any computer today. Thus, relaxation of the operation-count restriction is necessary

to further reduce to memory usage to reasonable values.

We perform the following simple transformations to the DAG and the correspond

ing potential fusion graph (see Section 3.9).

• Two additional vertices are added: one for a A:-loop around fiQ and the other

for a (-loop around f j . These additional vertices are then connected to the

corresponding vertices in f n with additional potential fusion edges to allow

more loop fusion opportunities between f n and its two children.

1 0 2

• The common sub-expressions Y, exp, and fe are split into multiple nodes. Two

copies o f sub-DAG rooted at fs are also made. This will overcome the require

ments on legal fusion graphs for DAGs discussed in Section 3.6.

The memory usage minimization algorithm for static memory allocation is then

applied on the transformed potential fusion graph. The loop fusion configuration,

the fusion graph, and the memory usage and operation count statistics o f the optimal

solution found are shown in Figure 3.23. For clarity, the input arrays are not included

in the fusion graph. The memory usage of the optimal solution after relaxing the

operation-count restriction is significantly reduced by a factor of about 1 0 0 to 1 . 1 2 x

10 array elements. The operation count is increased by only around 10% to 2.10 x

10^ . Compared with the best hand-optimized loop fusion configuration, which also

has some manually-applied transformations to reduce memorv' usage to 1 . 1 2 x 1 0 ®

array elements and has 5.08 x 10 operations, the optimal loop fusion configuration

obtained by the algorithm shows a factor o f 2.5 improvement in operation count while

using the same amount o f memory.

103

for r for RL[ Y[r,RL]=genYCr,RL]

for tinit f2 for RL

for RL2G=genG(RL2,RL,t ) for rlr fl=YCrl,RL]-«G [ f2[rl,RL2]+=fl

for rl for r

init f6,f6’ for RL2

f5=Y[r,RL2]»f2[r1.RL2] f6+=f5f5’=Y[r1.RL2]»f2[r,RL2] f6’+=fS’

fl0[r]=f6*f6’ for k

for rr f7=erp[k.r]»exp[k,rl][ f 11 [r] =f7*f 10 [r] fl3[k,rl,l:KG]=fft(fll[l:Kr])

fl5[l:Hk.l:NG,l:NGl]=fft(fl3[l:Nk.l:Hrl,l:NG]) . write fl5[l:fîk,l:HG,l:NGl]

C t k C l

fe

fô

r l RL2R L l t

fr

(a) Loop fusion configuration (b) Fusion graph

Array Array Memory Operationdimensions usage count

G - 1 -

Y r,RL 1 0 ^f i - 1 1 0 'f2 r l ,R L l ,R L 2 1 0 * 1 0 ^fô - 1 lO 'fô - 1 1 0 *fe - 1 lO '*fe - 1 10 4/7 - 1 1 0 1 *fio r 1 0 ^ 1 0 1 %f l l T 1 0 = 1 0 1 *fl3 k ,G , r l 1 0 ^ 1 . 6 6 X 1 0 1 *flô k,G, G1 1 0 ^ 1 .6 6 X 1 0 1 *

Total 1 .1 2 X 1 0 ^ 2 . 1 0 X 1 0 *

(c) Memory usage and operation count

Figure 3.23: Optimal loop fusions for the example formula sequence.

104

CHAPTER 4

COM M UNICATION M INIM IZATION

Given a sequence o f formulae, we now address the problem of finding the optimal

partitioning of arrays and operations among the processors in order to minimize inter

processor communication and computational costs in im plem enting the computation

on a message-passing parallel computer. Section 4.1 describes a multi-dimensional

processor model and characterizes the communication and com putational costs. Sec

tion 4.2 shows how the multi-dimensional processor model can be applied to analyze

the communication and com putational costs of matrix m ultiplication on parallel com

puters. Section 4.3 presents a dynamic programming algorithm for the communication

minimization problem. An example illustrating the use of the implemented algorithm

is provided in Section 4.4. The modifications to the algorithm for handling common

sub-expressions, sparse arrays, and FFT are discussed in Sections 4.5, 4.6 and 4.7,

respectively. Section 4.8 integrates the problems of communication minimization and

memory usage m inimization and proposes two approaches for determining the distri

bution of data arrays and com putations among processors and the loop fusions that

minimize inter-processor communication for load-balanced parallel execution, while

not exceeding the memory limit.

105

s Ej

/3 X

x \I l E l / 2 E t

A[z,j,£] B\j,k ,t ]

Figure 4.1; An exam ple expression tree.

4.1 Preliminaries

We use a logical view of the processors as a multi-dimensional grid, where each

array can be distributed or replicated along one or more of the processor dimensions.

Let pd be the number of processors on the d-th dimension of an n-dimensional pro

cessor array, so that the total number of processors is pi x p2 x . . . x p^. We use

an n-tuple to denote the partitioning or distribution of the elements of a data array

on an n-dimensional processor array. The d-th position in an n-tuple a, denoted

Q[d], corresponds to the d-th processor dimension. Each position may be one of the

following: an index variable distributed along that processor dimension, a denot

ing replication of data along that processor dimension, or a ‘1 ’ denoting that only

the first processor along that processor dimension is assigned any data. If an index

variable appears as an array subscript but not in the n-tuple, then the corresponding

dimension of the array is not distributed. Conversely, if an index variable appears in

the n-tuple but not in the array, then the data is replicated along the corresponding

processor dimension, which is the same as replacing that index variable with a

106

As an example, consider the expression tree shown in Figure 2.1 and reproduced

in Figure 4.1. Suppose 64 processors form a 2 x 4 x 8 array. For the 3-dimensional

array B [ j ,k , t ] , the 3-tuple (&\*,1) specifies that the second dimension of B is dis

tributed along the first processor dimensions, the first and third dimensions of B

are not distributed, and that data are replicated along the second processor dimen

sion and are assigned only to processors whose third processor dimension equals 1 .

Thus, a processor whose id is Pzi,z2,zz will be assigned a portion of B specified by

B[1 : N j,myrange{z i , Nk-:P\),1 : Nt)] if Z3 = 1 and no part of B otherwise, where

myrange{z, N . p ) is the range (z — 1 ) x N / p -I- 1 to z x N /p .

We assume the SPMD programming model and do not consider distributing the

computation o f different formulae on different processors. Since a formula sequence

can be represented by an expression tree and each node in the expression tree corre

sponds to an array, we use the terms ‘node’ and ‘array’ interchangeably and sometimes

refer to the array corresponding to a node v as array v.

A child array is redistributed before the evaluation of its parent if their distribu

tions do not match. For instance, suppose the arrays f i [ j , t\ and / 2 b, i] in Figure 4.1

have distributions { l . t . j ) and (j, *, 1 ) respectively. If we want fz to have distribution

{ j . t , 1 ) when evaluating fs l j . t ] = f i [ j , t ] x / 2 b',i], f i would have to be redistributed

from (1, t, j ) to { j , t , 1) because the two distributions do not match. But for / 2 to go

from {j, *, 1 ) to { j , t , 1 ), each processor just needs to give up part of the (-dimension

of the array and no inter-processor data movement is required.

107

The number of processors or processor groups holding distinct parts of an array

V w ith distribution a is given by:

DistFactor{v, q ) = pda \d \ € v.dim ens

For example, if the array B[j\ k, t] has distribution {k, *, 1 ), then there are only pg = 4

processor groups having distinct parts o f B.

Let Mcost(localsize, a , (3) be the communication cost in moving the elements o f an

array, with localsize elements distributed on each processor, from an initial distribu

tion a to a final distribution (3. It depends on several factors such as the underlying

processor topolog}', the amount of node and link contentions, and the message routing

mechanism employed. We empirically measure Mcost for each possible non-matching

pair o f a and (3 and for several different localsizes on the target parallel computer.

Let MoveCost{v, a , 13) denote the communication cost in redistributing the elements

of array v from an initial distribution a to a final distribution [3. It can be expressed

as:

where

MoveCost{v, a , /), / ) = Mcost{DistSize{v, a), a , ,3)

DistFactor{v, a)

is the number of elements of array v distributed on each processor.

Let CalcCost{v^ 7 ) be the computational cost in calculating an array v with 7 as

the distribution of v. For multiplication and for summation where the sum m ation

index is not distributed, the computational cost for v can be quantified as the total

number of operations for v divided by the number of processors working on distinct

108

/ 2 E t

f l X

/ XA[î,fc] B[k,j]

Figure 4.2: Expression tree representation of m atrix m ultiplication.

parts of V. In our example, if the array fz l j . t ] has distribution (j, t, I), its com

putational cost would be Nj x N tlv r lV i operations on each participating processor.

Formally,

where v.indices = v.dimens\J v.sumindex is the set of loop indices around the com

putation of V.

For the case of sum m ation where the summation index i = v.sumindex is dis

tributed, partial sums of v are first formed on each processor and then either consol

idated on one processor along the i dimension or replicated on all processors along

the same processor dimension. We denote by CalcCostl and M oveCost l the compu

tational and com m unication costs for forming the sum w ithout replication, and by

CalcCost2 and MoveCost2 those with replication.

4.2 Application to M atrix M ultiplication

Matrix m ultiplication is a simple application in which the above multi-dimensional

processor model can be applied to analyze the communication and com putational cost

on parallel computers. The standard form C [i , j ] = A:] x B [ k , j \ ) of matrix

109

multiplication can be rewritten as the following formula sequence:

/i[z , j-. k] = A[2, B [ k J \

C [ t j \ = f2[h j] = XI / i j: k]k

and represented as the expression tree shown in Figure 4.2.

Several existing m atrix m ultiplication algorithms for parallel computers fit well

into this model. One of them [31] is a simple parallel implementation o f the serial block

m atrix multiplication algorithm. In this simple parallel algorithm, the processors form

a 2-dimensional array. Initially, arrays A and B are fully block-distributed along both

processor dimensions. The initial distribution can be defined by the 2-tuples (z, k) for

array A and {k , j ) for array B. In order for each processor F ,j to acquire all the data

it needs to compute the sub-block C i j of the result matrix, sub-blocks of .4 and B

are then broadcast along the second and the first processor dimensions respectively.

The 2-tuples for the data distribution after broadcast is which is equivalent

to (z, *) for A and (* , j ) for B. The costs of broadcasting .4 and B are denoted as

MoveCost{A, ( i ,k) , and MoveCost{B, (k , j ) , (z ,i) ) respectively.

As the distribution is now identical for both A and B, multiplication can take

place. The distribution of the multiplication operation is again (z, j ) (note that the

k dimension is not distributed) and the cost of this operation is CalcCost{fi ,

The intermediate result / i [z, j , A:] now resides on P i j and hence has the same distri

bution tuple The final step is to add up the products on each processor. The

distributions of the sum m ation operation and the result array C are both (z, j ) . The

cost of the summation is C alcCos t l { f 2 , No further data movement is required.

no

Hence, the total execution time of this algorithm is represented as:

MoveCost{A, (z, k), (z, j ) ) + MoveCost{B, (k , j ) ,CalcCost(fi, ( i . j ) ) + CalcCostl{ f2 ,

Another matrix multiplication algorithm that fits into our model is known as the

DNS algorithm [31], which is based on a 3-dimensional processor view. The source

arrays A and B are initially distributed the same way on the bottom processor plane

where the third processor dimension is 1 . Thus, their initial distributions can be

specified by the 3-tuples (z. A:, 1) for array A and ( k , j , l ) for array B. Then, the

elements of these two arrays are broadcast to all other processors in such a way that

processor Pij,k will have A[i,k] and B [k , j ] . This intermediate data distribution is

described by { i , j , k ) , which is the sam e as ( i ,* ,k ) for array .4 and (*-,j,k) for array

B . The cost of redistributing A and B are denoted as MoveCost{A, (z, k, 1 ), { i , j , k))

and MoveCost{B , {k , j , I), (z, j, k)) respectively.

Now that both arrays have the same distribution, the multiplication step can be

carried out under the same loop distribution { i , j , k). The intermediate result A:]

represents the product now stored in Pij,k and hence has the same data distribution

( i , j , k ) . The computational cost in forming / i is denoted as CalcCost{fi , {i, j , k)).

Finally, the sums C[i, j] for all z and j are formed by single-node accumulation

of the products along the third processor dimension, and C[i , j ] ends up in Pijfi.

The loop distribution for the sum m ation step is again ( i , j , k ) , but the data dis

tribution of C is (z,y, 1 ). In this step, the computational and communication costs

are CalcCostl { f2 , { i , j , k)) and MoveCost2{f2, (z, j, k), { i , j \ 1 )) respectively. Therefore,

the total execution time can be expressed as:

MoveCost{A, (z, k, 1 ), (z, j . A:)) -f- MoveCost{B, {k ,j , 1 ), { i , j , k ) ) +CalcCost{fi, { i , j , A:)) -f CalcCostl { f2 , { i , j , k)) + MoveCost2{f2, (z, j , k), (z, j . I))

1 1 1

4.3 A Dynam ic Programming Algorithm

We assume the input arrays can be distributed initially among the processors

in any way at zero cost, as long as they are not replicated. We do not require the

final results to be distributed in any particular way. Our approach works regardless of

whether any initial or final data distribution is given. If all data arrays and loop nests

have the same distribution n-tuple (for an n-dimensional processor array), no data

movement among the processors will be required during execution. This is achievable

if and only if there exist n indices that appear in the index sets o f every data array

and every- loop nest. W'hen this condition cannot be satisfied, we need to determine

the combination of the distribution n-tuples for the data arrays and the loop nests

that minimizes the com putational and communication costs.

A dynamic programming algorithm that determines the distribution o f dense ar

rays to minimize the com putational and communication costs is given below.

1. Transform the given sequence of formulae into an expression tree T (see Sec

tion 2 .2 ).

2. Let Cost(v, a ) be the minimal total cost for the subtree rooted at v w ith distri

bution a . Initialize Cost{v, a) for each leaf node v in T and each distribution

a as follows:

_ f 0 if NoRepl ica te(a)o s [ v , a ) - ^ minA:oRepaca(e(/3){MoueCos((u, /), a ) } otherwise

where NoRepl ica te{a) is a predicate meaning a involves no replication.

3. Perform a bottom -up traversal of T. For each internal node u and each distri

bution a , calculate Cost{u, a ) as follows:

1 1 2

Case (a): u is a m ultiplication node with two children v and v'. We need both

V and v' to have the same distribution, say 7 , before u can be formed. ,A.fter

the m ultiplication, the product could be redistributed if necessary. Thus,

Cost{u, a) = nnn{C'osi(u, 7 ) 4- Cost{v', 7 ) + CalcCost(u, 7 ) + MoveCost{u, 7 , a )}

Case (b): u is a summation node over index i and with a child v. v may have

any distribution 7 . If i 6 7 , each processor first forms partial sums of u and

then we either combine the partial sums on one processor along the i dimension

or replicate them on all processors along that processor dimension. Afterwards,

the sum could be redistributed if necessary. Thus,

Cost{u ,a ) = miiLy{Cost{v, j) + min{CalcCostl {u, j ) + MoveCostl {u ,'y ,a) ,CalcCost2{u, 7 ) 4- MoveCost2{u, 7 , a ) ) }

In either case, save into Dist{u, a) the distribution 7 that minimizes Cost{u, a) .

4. W hen step 3 finishes for all nodes and all indices, the minimal total cost for

the entire tree is mina[Cost{T.root, q )} . The distribution a that minimizes the

total cost is the optim al distribution for T.root. The optim al distributions for

other nodes can be obtained by tracing back Dist{u, a ) in a top-down manner,

starting from Dist{T.root, a).

The running tim e com plexity of this algorithm is 0{q^\T\) , where |T| is the number

of internal nodes in the expression tree and q = 0 (m") is the number of different

possible distribution n-tuples, and m is the number of index variables. The storage

requirement for Cost{u ,a) and Dis t{u ,a) is 0 {q \T \ ) .

113

4.4 An Example

The above algorithm for dense arrays has been implemented. As an illustration,

we apply it to a triple matrix multiplication problem specified by the following input

file to the program:

fl[i,j,k] = A[i,j] * B[j,k] f2[i,k] = sum j fl[i,j,k] f3[k,l,m] = C[k,l] * D[l,m] f4[k,m] = sum 1 f3[k,l,m] f5[i,k,m] = f2[i,k] * f4[k,m] f6[i,m] = sum k f5[i,k,m] i 200 j 96000 k 400 1 84000 m 200 end

The first six lines specify the sequence of formulae. The next five lines provide

the ranges of the index variables. Note that the matrices are rectangular. The target

parallel machine is a Cray T3E at the Ohio Supercomputer Center. We empirically

measured the processor speed for the computation kernel (found to be about 400 M-

fiops) and MCost (which is used to calculate MoveCost for each possible pair of initial

and final distributions for several different message sizes. These measurements are

given as auxiliary input to the program. Eight processors viewed as a logical two-

dimensional 2 x 4 array are specified.

The optim al distribution of the arrays that minimizes the total computational

and communication tim e as generated by the program is shown in Table 4.1. The

appearance of two n-tuples under the distribution column indicates redistribution of

the array. CalcCost and MoveCost are expressed in seconds. For /g and / 4 , the partial

sums are not replicated.

114

Array Size Distribution ( 7 —>• a ) CalcCost{u, 7 ) MoveCost{u. 7 , a )^[U j] 1.92 X 10' ( i , j ) - 4 (* ,; ) 0.000 0.793B [j .k \ 3.84 X 10' <^,J> 0.000 0.000C[kA] 3.36 X 10^ 0.000 0.000D[l, m] 1.68 X 10' (m,Z) - 4 (*,Z) 0.000 0.694

7.68 X 10 2.400 0.000f2[i.k\ 8.00 X 10‘‘ ( k j ) -4 (z, *) 2.400 0.024h [ k ,L m ] 6.72 X 10 (A:,/) 2.100 0.000f 4 [k,m] 8.00 X lO'* (k J ) -4 (* ,m ) 2.100 0.024f5 [ i ,k ,m \ 1.60 X 10" (z, m) 0.005 0.000fe[i, m] 4.00 X 10“* (z, m) 0.005 0.000Total time 9.010 1.535

Table 4.1: Optimal array distributions for an example formula sequence.


W ith the introduction of common sub-expressions, a formula sequence must be

represented as a directed acyclic graph (DAG) instead o f an expression tree since each

common sub-expression appears as an operand in more than one subsequent formulae

and has multiple parents. If the algorithm for a tree (in Section 4.3) is applied on

a DAG, problems such as cost double-counting, distribution mismatch, and wrong

solution may arise due to the multi-parent nodes. We want to be able to compute

each intermediate array only once and use it multiple times as needed (possibly with

multiple redistributions).

Let u be a multi-parent node. We propose the following changes to the algorithm.

• To each parent of u, we add an extra MoveCost cost for the redistribution of

the V.

115

• The minimization over the distribution o f v is not performed at its parents,

but rather at the dominator node o f v, denoted DOM(v) (which is the closest

ancestor of v that every path from v to the root must pass through).

• For each node v' on any path from v to DOM {v), we keep a separate Cost{v', a)

for each possible distribution of v.

To illustrate the changes, consider the following formula sequence in which f i has

2 parents.

f i [z, A:] = A[i, j ] X B [j, A:]

f2[h i, k] = f i [z,i, k] X C[z, A:]

= f i [ j , k , i ] ^ f 2 [ i j , k ]

The equations for finding the lowest cost are:

C o s t{ f i ,a ) = min{Cost(A,''f) + Cost^B.-y) + C alcCost{fi,-f)-h

M oveCost(fi, 7 , a )}

C ost{f2 , 0) |(/i,a) = min{C'o5 f ( / i , a ) + M o ve C o s t{ f i ,a , /?) + Cost{C, ,3) +

CalcCost{f2 , (5) + M oveC ost{f2 , 3 ,9 ) }

Cost{fz, (j>) = m m {nun{C 'osf(/i, a ) + M oveC ost(fi, a , 6) + Cost{f2 , 6) |(/i,q)}

+ CalcCost{fz, 6) + MoveCost{fzj 9, 4>)}

Note that the com plexity of the revised algorithm is now 0(g^'^^|T|), where |T | is

the number of nodes in the DAG, q is the number of different possible distributions,

and t is the number of multi-parent nodes that are ‘open’ at a time.

A DAG also leads to another com plication called the Steiner tree effect [6 6 ], i.e.,

the distributions of the parent nodes of a multi-parent node v may be obtained with

116

a lower cost by including more ‘transit’ nodes between v and its parents. The revised

algorithm has taken care of this effect for 2-parent nodes. To obtain optim al solutions

for nodes with more than 2 parents, more Steiner trees with more M oveCosts have to

be considered.

4.6 Sparse Arrays

A sparse array is said to be evenly distributed among the processors if an equal

number of array elements is assigned to each processor. We do not consider un

even distribution of sparse arrays as it would lead to load imbalance and probably

sub-optimal performance. W ith the uniform sparsity assumption, a sparse array is

guaranteed to be evenly distributed if no two distributed array dimensions appear in

the same sparsity entry or are reachable from each other in the sparsity graph (see

Section 2.7). .A.S an example, if the array v [ i , j ,k J \ has sparsity entries ( i , j , 0.1) and

{j, k, 0.1), then at most one of the 3 indices i, j and k would be distributed: otherwise,

uneven distribution would result.

Since zero elements in sparse arrays do not participate in com putation or data

movement, the array size component of the computational or com m unication costs

for sparse arrays equals the number of non-zero elements. In other words,

CalcCost{v, a ) = CalcCost{v', a)

M oveC ost(v ,a , ,8) = M oveCost{v',a, P)

where v' is a dense array which has the same number of non-zero elem ents as v. These

formulae for CalcCost and M oveCost of sparse arrays are exact unless the indices

assigned to a processor before and after redistribution of v are mutually reachable in

the sparsity graph, in which case M oveCost would be an approximation.

117


Since exponential functions are computed on the fly (by FFT routines), they are

neither stored as arrays nor moved between processors and the costs of forming them

are usually absorbed into the FFT costs. Thus, we can simplify a DAG by replacing

a multiplication node whose children are exponential functions by an exponential

function leaf node.

An FFT formula introduces into a DAG a new kind of node called an FFT node,

which has the sum m ation index as its label and the operand array and an exponential

function as its two children. Let u be an FFT node with summation index i and

operand arrays v{K ,i] and exp[z, j], and let 7 be the distribution of v. The minimal

total cost for the DAG rooted at u with distribution a is evaluated as follows. If

i 0 7 , each processor independently performs serial FF T s on its local portion of v;

otherwise, group(s) o f processors perform parallel FFTs collectively on their portions

of V. Afterwards, u may be redistributed if necessary. Hence,

Cost{u,a) =

{ miny{C'osf(u, 7 ) + CalcCost3{u, 7 ) -t- MoveCost{u, 7 , a) if z 0 7

miiLy{Cost[v,j) -t- CalcCost4{u,'y) + MoveCost4{u,'y) + MoveCostlu.y ,a) otherwise

where CalcCostS is the computational cost for forming the serial FFTs, CalcCost4

and MoveCost4 are the computational and communication costs for forming the par

allel FFTs, and 7 ' is 7 w ith i replaced by j .

4.8 Communication Minimization with M emory Constraint

Given a sequence of formulae, we now address the problem of finding the optimal

partitioning of arrays and operations among the processors and the loop fusions on

each processor in order to minimize inter-processor communication and computational

118

costs while staying within the available memory in implem enting the computation on

a message-passing parallel computer. Section 4.8.1 discusses the combined effects of

loop fusions and array/operation partitioning on communication cost, computational

cost, and memory usage. Two approaches for solving this problem are presented in

Section 4.8.2.

4.8.1 Preliminaries

The partitioning of data arrays among the processors and the allowable fusions

of loops on each processor are inter-related. For the fusion of a t-loop between nodes

u and V to be possible, that loop must either be undistributed at both u and u, or

be distributed onto the same number of processors at u and at v (but not necessarily

along the same processor dimension). Otherwise, the range of the t-loop at node

u would be different from that at node v. As an example, suppose 128 processors

form a 2 x 2 x 4 x 8 array. Consider the expression tree shown in Figure 3.1(c) and

reproduced in Figure 4.3. If array B[j, k, I] has distribution (/, k, *, 1 ) and fusion (jl)

with f 2 [j, k, I], then /g can have distribution { l , j , *, k) but not (A:, Z, / , 1 ) because the

fusion {jl) forces the j-dim ension of /a to be distributed onto 2 processors and the

Z-dimension to be undistributed. Array partitioning and loop fusion also have effects

on memory usage, communication cost, and com putational cost.

Fusing a Z-loop between a node v and its parent elim inates the Z-dimension of

array v. If the Z-loop is not fused but the Z-dimension o f array v is distributed along

the cZ-th processor dimension, then the range of the Z-dimension of array v on each

processor is reduced to Nt/pd- Let D istS ize{v ,a , f ) be the size on each processor of

119

/ô Hi

h Xx \f l H: /a H t

/ 2 Xy \Figure 4.3: The expression tree in Figure 3.1(c).

array v, which has fusion / with its parent and distribution a . We have

DistSize{v, a , f ) = JJ DistRange{i, v, a , Se t{f))i € V . dimens

where v.dimens = v.indices — {y.sum index} is the array dimension indices of v before

loop fusions, and

1 if 2 E r

^i/Pd if z 0 % and i = a[d]Ni if i 0 X and i 0 a

In our example, assume again that N = 500, Nj = 100, iV = 40, and Nt = 15. If the

array B[j, k, I] has distribution {j\ k, *, 1 ) and fusion ( jl ) w ith /a, then the size of B on

each processor would be A^t/2 = 20 since the A:-dimension is the only unfused dimen

sion and is distributed onto 2 processors. Note that if array v undergoes redistribution

from a to /?, the array size on each processor after redistribution is DistSize{v, P , f ) ,

which could be different from D istS ize{v ,a , f ) , the size before redistribution.

The initial and final distributions of an array v determines the communication

pattern and whether v needs redistribution, while loop fusions change the number

120

DistRange{i, v, a , z ) —

o f tim es array v is redistributed and the size o f each message. Let v be an array

that needs to be redistributed. If node v is not fused with its parent, array v is

redistributed only once. Fusing a t-Ioop between node v and its parent puts the

collective communication code for redistribution inside the (-loop. Thus, the number

of redistributions is increased by a factor o f N t/pd if the (-dimension of v is distributed

along the d-th processor dimension and by a factor of Nt if the (-dimension of v is

not distributed. In other words, loop fusions cannot reduce communication cost.

Continuing with our example, if the array B [j, k, /] has fusion {jl) with / 2 and needs

to be redistributed from (j, fc, *, 1) to (fc, j , *, 1), then it would be redistributed N jj2 x

Ni = 750 times.

Let Mcost{localsize, a , P) be the com m unication cost in moving the elements of

an array, with localsize elements distributed on each processor, from an initial distri

bution a to a final distribution 8. We em pirically measure Mcost for each possible

non-matching pair of a and P and for several different localsizes on the target parallel

computer. Let M oveC ost{v ,a , P, f ) denote the communication cost in redistributing

the elements of array v, which has fusion / w ith its parent, from an initial distribution

a to a final distribution p. It can be expressed as:

M oveC ost{v ,a , P, f ) = M sgFactor{v,a, S e t{ f ) ) x M cost{DistSize{v,a, S e t { f ) ) ,a , P)

where

MsgFactoriv, a , x ) = LoopRange{i, v, a , x)i € V . dimens

LoopRange{i, v, a , x)1 i f i X

N i / p d i f i Ç. X and i = a [ d \

Ni if i G X and i a

121

Note that com putatioaal cost CalcCost is unaffected by loop fusions (see Sec

tion 4.1).

4.8.2 Two Approaches

In Section 4.3, we have solved the communication minimization problem but with

out considering loop fusion or memory usage. In practice, the arrays involved are

often too large to fit into the available memory even after partitioning among the

processors. We now present two approaches of extending the framework developed

in Sections 3.1 to 3.5 to the parallel computing context to solve the communication

minimization problem with memory constraint. We assume the input arrays can be

distributed initially among the processors in any way at zero cost, as long as they are

not replicated.

Our first approach is to find an optimal loop fusion configuration from among

all the communication-optimal distribution configurations so that the memory usage

is below the limit but without increasing the communication cost. It decouples the

communication minimization problem and the memory usage minimization problem.

This approach has two phases. The first phase is to determine all distribution

configurations that minimize the computational and communication costs. A dynamic

programming algorithm for this purpose is given below. It assumes that no loops are

fused. The algorithm is basically the same as the one in Section 4.3 except that we

now keep the set of all optim al solutions instead of one optimal solution for each node

in the expression tree.

1 . Transform the given sequence of formulae into an expression tree T (see Sec

tion 2.3).

122

2. Let Cost{v, a) be the minimal total cost for the subtree rooted at v with distri

bution Q. Initialize Costiy, a ) for each leaf node v in T and each distribution

a as follows:

^ \ _ f 0 if jVoReplicate{a)\ mmrsroRepiicate(P){MoveCost{v, l3,a,<d)} otherwise

where N oR eplica te{a) is a predicate meaning a involves no replication.

3. Perform a bottom-up traversal of T. For each internal node u and each distri

bution a , calculate Cost{u, a ) as follows:

Case (a): u is a multiplication node with two children v and v'. We need both

V and v' to have the same distribution, say 7 , before u can be formed. After

the multiplication, the product could be redistributed if necessary. Thus,

Cost{u, a) = niin{ C'ost(u, ''f)+Cost{v', j)-hCalcCost{u, ''f)+MoveCost{u, 7 , or, 0)}

Case (b): u is a summation node over index i and with a child v, which may

have any distribution 7 . If i G 7 , each processor first forms partial sums of

u and then we either combine the partial sums on one processor along the i

dimension or replicate them on all processors along that processor dimension.

Afterwards, the sum could be redistributed if necessar}^ Thus,

Cost{u, a ) = min^{Cost{v, 7 ) -f- m m {CalcCostl{u , 7 ) -j- M oveCostl{u, 7 , a , 0), CalcCost2{u, 7 ) -f- MoveCost2{u, 7 , a , 0 ))}

In either case, save into DistSet{u, a) the set of distributions 7 that minimizes

Cost{u, a ).

4. When step 3 finishes for all nodes and all indices, the minim al total cost for the

entire expression tree T is mm.a{Cost{T.root, a )} . The distributions a that min

imizes the total cost for T are the optimal distributions for T.root. The optimal

123

distributions for other nodes can be obtained by tracing back DistSet{u, a) in

a top-down manner, starting from DistSet{T .rooLa) for each a.

The running tim e com plexity of this algorithm is 0{q^\T\), where |T| is the number

of nodes in the expression tree, q = 0 {m ^ ) is the number of different possible dis

tribution n-tuples, and m is the number of index variables. The storage requirement

for C ost{u ,a) is 0 {q \T \) , and for DistSet{u, a ) is 0{q^\T\).

The outcome of the first phase is a set o f array distribution configurations that

have the same minimal total communication and com putational cost. These opti

mal distribution configurations are then fed into the second phase to find a loop

fusion configuration that uses no more than the available memory on each processor

without increasing the communication cost. One of the memory-optimal loop fusion

algorithms in Section 3.3 and 3.5 (depending on the memory allocation model) can

be used in the second phase, but with the following changes:

1 . To insure the communication cost stays minimal, if an array v is redistributed,

then V must not be fused with its parent. Arrays that are not redistributed can

be fused freely with their parents since they do not contribute to any commu

nication cost. The I n itF u s ib le procedure can be easily modified for this.

2 . In calculating the array sizes on each processor, the DistSize function in Sec

tion 4.8.1 is used in place of the FusedSize function so that the effect of distri

bution on array sizes is accounted for.

3. If the size of an array before and after redistribution is different, the higher of

the two should be used in determining memory usage.

124

Since the goal is to stay within the available memory rather than to minimize

memory usage, we do not need to apply the algorithm on all the optimal distribution

configurations. We can stop as soon as we find the memory usage of an optimal

loop fusion for a distribution configuration is below the limit. In case the number

of optimal distribution configurations is large, the following pruning methods can be

applied.

1. Any distribution configuration with a redistributed array whose size is larger

than the available memory can be pruned.

2. For static memory allocation, any distribution configuration in which the total

size of all redistributed arrays is above the memory limit can be pruned.

3. If several distribution configurations have the same set of redistributed arrays,

all but one of them can be pruned. This is because they have the same set of

fusible loops and hence the same minimal memor}' usage.

4. Instead of processing one distribution configuration at a time, apply the algo

rithm on the optim al distributions for each array v and prune the distributions

for the children of v that have inferior memorv' usage.

The advantage of the first approach is its efficiency as it only searches among

the optimal distribution configurations. If it finds a memory-optimal loop fusion for

an optimal distribution configuration that uses no more memory than available, the

solution must be optim al because no other solution could possibly use less memory

than an optimal distribution configuration. But if none of the optimal distribution

configurations can be fused to stay within the memory lim it, the first approach finds

no solution.

125

The second approach is to search among all combinations of loop fusions and array

distributions to find one that has minimal total communication and computational

cost and uses no more than the available memory. It is also a bottom-up. dynamic

programming algorithm. At each node v in the expression tree T, we consider all legal

combination of array distributions for v and loop fusions between v and its parent. A

combination is illegal if there exists a (-loop fused between v and its parent and the

(-dimension of the array v is distributed onto a different number of processors before

and after redistribution. The array size, communication cost, and computational

cost are determined according to the equations in Section 4.8.1. At each node u,

a set o f solutions is formed. Each solution contains the final distribution of v, the

loop nesting at v, the loop fusion between v and its parent, the total communication

and computational cost, and the memory usage for the subtree rooted at v (which

is a mem field for static allocation, or a seqset structure for dynamic allocation). A

solution s is said to be inferior to another solution s' if they have the same final

distribution, s.nesting Ç s' .nesting, s.totalcost > s' .totalcost, and the memory usage

of s is inferior to s' (as defined in Sections 3.3 and 3.5). An inferior solution and any

solution that uses more memory than available can be pruned. At the root node of

T, the solution(s) with the lowest total cost is the optimal solution for the entire tree.

This approach is exhaustive. It always finds an optimal solution if there is one.

The algorithm can be easily modified to minimize memory usage as well.

In comparison, the first approach is more efficient but is not guaranteed to find

a solution while the second approach has such a guarantee but is slower. Hence,

we suggest that the first approach should be applied first. Only if it cannot find a

solution do we use the second approach.

126

C H A PTER 5

CONCLUSIONS

In this dissertation, we have addressed performance optim ization issues for the

parallel execution of a class of nested loops that arise in the context of some com

putational physics applications modeling electronic properties of semiconductors and

metals. The com putation essentially involves a multi-dimensional summation (or dis

cretized integral) of a product of a number of array terms. Besides the typical parallel

computing considerations of mapping of the data and com putations onto the proces

sors, two other performance aspects also require attention. One aspect is the total

number of arithm etic operations, which can be reduced by judiciously restructuring

the computations through use of algebraic properties of commutativity, associativity,

and distributivity. The other aspect is memory usage, which is a function of loop

fusion and loop reordering.

We have proved that the problem o f finding a sequence of nested loops that com

putes a given multi-dimensional summation using a minimum number of arithmetic

operations is NP-com plete. An efficient pruning search strategy to determine the op

timal restructuring of the computations has been provided. We have implemented the

operation m inim ization algorithm and have used it to obtain significant improvement

127

in operation count for self-energy electronic structure calculations in a tight-binding

scheme.

We have considered the problem of seeking a loop fusion configuration that min

imizes memory usage. Based on a framework that models loop fusions and memor}-

usage, we have presented algorithms that solve the memory usage minimization prob

lem under both static memory allocation and dynam ic memor}- allocation. Several

ways to further reduce memory usage at the cost o f higher number of arithmetic

operations were described.

The optim al mapping of the data and com putations to minimize total inter

processor communication and com putational costs has been addressed. We have

proposed a framework for describing the partitioning of arrays among processors

and for analyzing the amount of data movement between processors under a multi

dim ensional processor view. A dynamic programming algorithm for finding an opti

mal partitioning of data and operations among processors was given. It was shown

that different parallel algorithms for matrix m ultiplication correspond to instances

that would autom atically be evaluated in the proposed framework. We have also

described two approaches for finding a loop fusion and array distribution cod figura

tion that minimizes communication and com putational cost while staying within the

available memory.

In practice, some arrays could be sparse and there also is an opportunity to reuse

com m on sub-expressions in computing a m ulti-dimensional summation. Moreover,

some sub-com putations in summations involving exponential functions can be com

puted more efficiently with fast Fourier transforms. These three characteristics were

addressed in the dissertation and generalizations to the algorithms for the operation

128

minimization problem, the communication m inimization problem, and the memo 13-

usage m inim ization problem were proposed.

5.1 Research Topics for Further Pursuit

Building upon the foundations developed in this dissertation, research topics that

may be pursued in the context of optimizing these multi-dimensional summation

calculations include the following.

5.1.1 Generalization of The Class o f Nested-Loop Computations Handled

In the form of multi-dimensional summations considered in this dissertation, loop

bounds are assumed to have constant ranges and arrays are assumed to be directly

indexed by a hst of distinct loop indices. Relaxation o f these restrictions would facili

tate the optim ization of a wider class of nested-loop com putations but complicates the

optimization problems at the same time. Consider the following multi-dimensional

summation:

i] X B[2i -t- j , i - i])i=:l j=i+l

In this example, the loop bounds of j are affine functions o f i and the array subscripts

are no longer distinct single loop indices. Since the iteration space is triangular

instead of rectangular, the number of arithmetic operations is reduced by a factor.

Furthermore, if array A is generated one element at a time rather than precomputed,

we only need to generate a diagonal plane of A, which is the only portion of .4 involved

in forming S. These changes could affect the operation count, the memory usage, and

the amount o f communication in evaluating the m ulti-dimensional summation. Such

129

effects on the cost models and on the existing optim ization algorithms need to be

studied.

5.1.2 Optimization of Cache Performance

Memory access cost can be reduced through loop transformations such as loop

tiling, loop fusion, and loop reordering. Although considerable research on loop trans

formations for locality has been reported in the literature [1, 3, 9, 14, 30, 39, 42, 67],

they generally do not consider issues related to the need to perform both loop fusion

and loop tiling together for locality and memory usage. Initial work on this problem

is reported in [13]. For a fully associative cache and a restricted class o f expression

trees in which each node is equivalent to a matrix multiplication, an algorithm is

presented for finding the tile sizes, loop fusions, and the partitioning of work among

processors that minimize cache misses while keeping memory usage under a given

limit. Further work on the generalization of the class of expression trees handled by

the algorithm is desirable.

5.1.3 Optimization of Disk Access Cost

Another performance metrics to be optimized is disk access cost. Some of the

input arrays may be disk-resident and the output array may need to be written to

disk for further analysis or visualization. When the amount of available memory is

tight, we can allocate a buffer for each input array and fill it with a portion of the array

read from the disk on an on-demand basis. Similarly, elements of the output array

can be stored in a buffer and written to disk when the buffer is full. If the optimal

loop fusion configuration still requires more memory than available, in addition to

trading operations for memory (see Section 3.9), we can also trade disk accesses for

130

memory. This is done by saving intermediate arrays to disk in chunks after their

production and reading them back from disk before their consumption. The buflFer

size for writing an array to disk and reading it back may be different. In general,

the number o f disk read/writes for an array is inversely proportional to the size of

the buffer allocated for the array. Thus, this optim ization problem can be stated as

follows: find a fused loop structure with disk read/write statements that implements

a given expression tree with the lowest disk access cost while not using more memory

than a given limit.

Such fused loop structures with disk read/w rite statements can be represented by

an extended form of fusion graphs (see Section 3.1). First, we extend an expression

tree by replacing each leaf node with a "read input array” node, adding a "write

output array” node as the parent of the root node, and inserting a “write intermediate

array” node and a “read intermediate array” node before every intermediate array and

its parent. Then, we extend a fusion graph by adding, for each “read” or “write” node,

a set of vertices, one for each dimension o f the array. In effect, an extended fusion

graph models disk read/write statem ent inside loop nests and allows us to consider

different possible fusions of the loops around array evaluation statem ents as well as

disk read/write statem ents. An algorithm sim ilar to the memory usage m inimization

algorithms (see Section 3.3 and 3.5) can be applied on an extended fusion graph to

find the loop fusions that result in minimal disk access cost under memory constraint.

To follow this approach, it is necessary to characterize the relationship between loop

fusions, buffer size, and disk access cost, and propose changes to existing algorithms

to handle disk accesses.

131

5.1.4 Developm ent of an Autom atic Code Generator

After implementing the optimization algorithms presented in this dissertation,

a logical next step is to develop an autom atic code generator that takes a multi

dimensional sum m ation as input and synthesizes the source code o f a parallel program

that computes the desired multi-dimensional sum m ation and is optim al in arithm etic

operation count and communication cost while not exceeding the available memory.

The automatic code generator would first use the operation m inim ization algorithm

to obtain a operation-minimal formula sequence, and then use the memory usage min

imization algorithm and the communication minimization algorithm to determine the

optimal loop fusions and the partitioning of the arrays and com putations am ong the

processors. Based on the results returned from the optim ization algorithms, together

with information about the target machine architecture and the target language, the

source code of a parallel program that computes the given m ulti-dim ensional summa

tion can be produced autom atically. Some issues that need to be addressed for code

generation include:

• Generating a fused loop structure from loop fusion information. Sections 3.3

and 3.5 have provided sufficient information on how a fused loop structure

corresponding to a fusion graph can be formed. Adding the loop bounds and

the array initialization, allocation, and deallocation statem ents to a fused loop

structure are quite straightforward.

• Specifying in the source code the distribution of arrays and com putation among

processors. This is language dependent. In MPI, for exam ple, this can be

achieved by reading or computing a different portion of an array depending on

1 3 2

the process ID. In some other languages, preprocessor directives or pragmas

may be used to specify how the arrays and loop iterations are partitioned.

• Redistributing arrays. For each array to be redistributed, we first have to de

termine the communication pattern in terms of which processors need to send

and/or receive data to/from which other processors and, for each source and

destination processor pair, which part o f the local array the source processor

needs to send. Some communication patterns can be classified as one or more

groups of processors performing concurrent broadcast or personalized commu

nication, which could be one-to-all or aU-to-all. To implement array redistribu

tions, collective communication library routines such as those in MPI or PVM

may be used. Efficient algorithms for array redistributions can be found in

[25, 47, 51, 53, 59, 60, 61. 63, 64].

• Computing fast Fourier transforms (FFTs). FFTs can be implemented by call

ing library routines such as FFTW . Section 3.8 describes which portions of a

source array and a result array participate in an FFT function call.

133

BIBLIOGRAPHY

[1] Anant Agarwal, David Kranz, and Venkat Natarajan. "Automatic partitioning of parallel loops for cache-coherent multiprocessors” . In InteTnation Conference on Parallel Processing, pages 12-111, St. Charles, XL, August 1993.

[2] Anant Agarwal, David Kranz, and Venkat Natarajan. “Autom atic partitioning of parallel loops and data arrays for distributed shared memory multiprocessors” . IEEE Transactions on Parallel and Distributed Systems, 6(9):946-962, September 1995.

[3] Jennifer M. Anderson, Saman P. Amarasinghe, and Monica S. Lam. “Data and computation transformations for multiprocessors” . In A C M SIG PL A N Symposium on Principles and Practice of Parallel Programming, pages 166-178, Santa Barbara, CA, July 1995.

[4] Jennifer M. Anderson and Monica S. Lam. “Global optim izations for parallelism and locality on scalable parallel machines” . In A C M SIG PL A N Conference on Programming Language Design and Implementation, pages 112-125, Albuquerque, NM, June 1993.

[5] Andrew W. Appel and Kenneth J. Supowit. “Generalizations of the Sethi-Ullman algorithm for register allocation” . Software—Practice and Experience, 17(6):417- 421, June 1987.

[6 ] W . Aulbur. Parallel implementation of quasiparticle calculations of semiconductors and insulators. PhD thesis. The Ohio State University, October 1996.

[7] Rajeev Barua, David Kranz, and Anant Agrawal. “Communication-minimal partitioning of parallel loops and data arrays for cache-coherent distributed- memory multiprocessors” . In Languages and Compilers fo r Parallel Computing, pages 350-368, San Jose, August 1996.

[8 ] Steve Carr and Ken Kennedy. Compiler blockability o f numerical algorithms. Technical Report CRPC-TR92208-S, Rice University, April 1992.

134

[9] Steve Carr, Kathryn S. McKinley, and Chau-Wen Tseng. "Compiler optim izations for improving data locality” . In Sixth International Conference on A rchitectural Support fo r Programming Languages and Operating Systems, pages 252-262, San Jose, CA, October 1994.

[10] Siddhartha Chatterjee, John R. Gilbert, Robert Schreiber, and Shang-Hua Teng. “Autom atic array alignment in data-parallel programs” . In 20th Annual A C M S IG A C T S /S IG P L A N Symposium on Principles o f Programming Languages, pages 16-28, New York, January 1993.

[1 1 ] Siddhartha Chatterjee, John R. Gilbert, Robert Schreiber, and Shang-Hua Teng. “Optimal evaluation of array expressions on massively parallel machines” . A C M TOPLAS, 17(1):123-156, January 1995.

[12] Miachal Ciemiak, Wei Li, and Mohammed Javeed Zaki. “Loop scheduling for heterogeneity” . In Fourth International Symposium on High Performance Distributed Computing, August 1995.

[13] Daniel Cociorva, John W ilkins, Chi-Chung Lam, and P. Sadayappan. “Transformations for parallel execution of a class of nested loops on shared-memory multi-processors” . Subm itted for publication.

[14] Stephanie Coleman and Kathryn S. McKinley. “Tile size selection using cache organization and data layout” . In A C M SIG PL AN Conference on Programming Language Design and Implementation, La Jolla, CA, June 1995.

[15] Jack Dongarra, Loic Pry Hi, Cyril Randriamaro, and Bernard Tourancheau. “Array Redistribution in ScaLAPACK using PVM ”. In Second European P V M Users Group Meeting, Lyon, France, September 1995.

[16] Jeanne Ferrante, Vivek Sarkar, and Wendy Thrash. “On estim ating and enhancing cache effectiveness” . In Fourth International Workshop on Languages and Compilers for Parallel Processing, pages 328-343, Santa Clara, CA, August 1991.

[17] C. N. Fischer and R. J. LeBlanc Jr. Crafting a Compiler. Benjam in/Cum m ings, Menlo Park, CA, 1991.

[18] Guang R. Gao, Russell Olsen, Vivek Sarkar, and Radhika Thekkath. “Collective loop fusion for array contraction” . In Languages and Compilers fo r Parallel Processing, pages 171-181, New Haven, CT, August 1992.

[19] Guang R. Gao, Vivek Sarkar, and Shaohua Han. “Locality analysis for distributed shared-Memory multiprocessors” . In Languages and Compilers fo r P arallel Computing, pages 20-40, San Jose, August 1996.

135

[20] Michael R. Garey and David S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Com pleteness. W. H. Freeman, New York, 1979.

[21] John R. Gilbert and Robert Schreiber. “Optimal expression evaluation for data parallel architecture” . Journal of Parallel and Distributed Computing, 13:58-64, September 1991.

[22] Leonidas J. Guibas and Douglas K. W yatt. “Compilation and delayed evaluation in APL.”. In Fifth A nnual A C M Symposium on Principles o f Programming Languages, pages 1-8, Tucson, Arizona, January 1978.

[23] Manish Gupta and Prithviraj Banerjee. “PARADIGM: A compiler for automatic data distribution on m ulticom puters.” . In A C M International Conference on Supercomputing, Tokyo, Japan, July 1993.

[24] M. S. Hybertsen and S. G. Louie. “Electronic correlation in semiconductors and insulators: band gaps and quasiparticle energies” . Phys. Rev. B, 34:5390, 1986.

[25] S. D. Kaushik, C.-H. Huang, J. Ramanujam, and P. Sadayappan. “Multiphase redistribution: A communication-efiB.cient approach to array redistribution” . IEEE Transactions on Parallel and Distributed Systems, 1995.

[26] S. D. Kaushik, C.-H. Huang, J. Ramanujam, and P. Sadayappan. “Multi-phase redistribution: m odeling and evaluation”. In International Parallel Processing Symposium, pages 441-445, April 1995.

[27] Wayne Kelly and W illiam Pugh. “A Unifying Framework for Iteration Reordering Transformations” . In International Conference on Algorithms A nd Architectures fo r Parallel Processing, pages 153-162, Brisbane, Australia, April 1995.

[28] Ken Kennedy and K athryn S. McKinley. “Optim izing for parallelism and data locality” . In A C M International Conference on Supercomputing, pages 323-334, Washington, DC, July 1992.

[29] Ken Kennedy and K athryn S. McKinley. “M aximizing loop parallelism and improving data locality v ia loop fusion and distribution” . In Languages and Compilers for Parallel Computing, pages 301-320, Portland, OR, August 1993.

[30] Dattatraya Kulkarni and Michael Stumm. Loop and data transformations: a tutorial. Technical Report CSRI-337, University o f Toronto, June 1993.

[31] Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis. Introduction to Parallel Computing: Design and Analysis of Algorithms. Benjam in/Cum m ings, RedW ood City, CA, 1994.

136

[32] Chi-Chung Lam, Daniel Cociorva, Gerald Baumgartner, and P. Sadayappan. M emory-optimal evaluation of expression trees involving large objects. Technical Report OSU-CISRC-5/99-TR13, Dept, of Computer and Information Science, The Ohio State University, May 1999.

[33] Chi-Chung Lam, Daniel Cociorva, Gerald Baumgartner, and P. Sadayappan. “M emory-optimal evaluation of expression trees involving large objects” . In International Conference on High Performance Computing, Calcutta, India, December 1999.

[34] Chi-Chung Lam, Daniel Cociorva, Gerald Baumgartner, and P. Sadayappan. “O ptim ization of memor>' usage and communication requirements for a class of loops implementing multi-dimensional integrals” . In Languages and Compilers fo r Parallel Computing, San Diego, August 1999.

[35] Chi-Chung Lam, P. Sadayappan, Daniel Cociorva, Mebarek Alouani, and John W ilkins. “Performance optim ization of a class of loops involving sums of products of sparse arrays” . In Ninth SIA M Conference on Parallel Processing for Scientific Computing, San Antonio, TX, March 1999.

[36] Chi-Chung Lam, P. Sadayappan, and Rephael Wenger. “Optimal reordering and mapping of a class of nested-loops for parallel execution” . In Languages and Compilers for Parallel Computing, pages 315-329, San Jose, August 1996.

[37] Chi-Chung Lam, P. Sadayappan, and Rephael Wenger. “On optimizing a class of m ulti-dimensional loops with reductions for parallel execution”. Parallel Processing Letters, 7(2): 157-168, 1997.

[38] Chi-Chung Lam, P. Sadayappan, and Rephael Wenger. “Optimization of a class of multi-dimensional integrals on parallel machines” . In Eighth SIAM Conference on Parallel Processing fo r Scientific Computing,, Minneapolis, MN, March 1997.

[39] Monica S. Lam, Edward E. Rothberg, and Michael E. Wolf. “The cache performance and optim izations of blocked algorithms”. In Fourth International Conference on Architectural Support fo r Programming Languages and Operating Systems, pages 63-74, Palo Alto, CA, April 1991.

[40] Wei Li. Compiling fo r NUMA Parallel Machines. PhD thesis, Cornell University, August 1993.

[41] Wei Li. Compiler optimizations for cache locality and coherence. Technical Report 504, University of Rochester, April 1994.

[42] Wei Li. “Compiler Cache Optimizations for Banded Matrix Problems” . In International Conference on Supercomputing, Barcelona, Spain, July 1995.

137

[43] C. C. Lu and W . C. Chew. "Fast algorithm for solving hybrid integral equations". IEEE Proceedings-H, 140(6):455-460, December 1993.

[44] Naraig Manjikian and Tarek S. Abdelrahman. "Array data layout for the reduction of cache conflicts". In International Conference on Parallel and Distributed Computing Systems, pages 111-118, Orlando, FL, September 1995.

[45] Naraig Manjikian and Tarek S. Abdelrahman. “Fusion of loops for parallelism and locality” . In International Conference on Parallel Processing, pages 11:19-28, Oconomowoc, WI, August 1995.

[46] Naraig Manjikian and Tarek S. Abdelrahman. “Reduction of cache conflicts in loop nests” . Technical Report CSRI-318, University o f Toronto, March 1995.

[47] Philip K. McKinley, Yih-Jia Tsai, and Darid F. Robinson. A survey of collective communication in wormhole-routed m assively parallel computers. Technical Report M SU-CPS-94-35, Michigan State University, June 1994.

[48] Edmund K. Miller. “Solving bigger problems by decreasing the operation count and increasing com putation bandwidth” . Proceedings of the IEEE, 79(10):1493- 1504, October 1991.

[49] Ikuo Nakata. “On compiling algorithms for arithmetic expressions” . Communications of the Association for Computing Machinery, 10:492-494, 1967.

[50] Miodrag Potkonjak, Mani B. Srivastava, and Anantha P. Chandrakasan. “Multiple constant multiplications: eflBcient and versatile framework and algorithms for exploring common expression elim ination” . IEEE Transactions on Computer- aided Design of Integrated Circuits and Systems, 15(2):151-164, February 1996.

[51] Loic Pry Hi and Bernard Tourancheau. Block cyclic array redistribution. Technical Report 95-39, Ecole Normale Supérieure de Lyon, October 1995.

[52] W illiam Pugh. “The Omega test: a fast and practical integer programming algorithm for dependence analysis” . Communications of the ACM, 8:102-114, August 1992.

[53] Shankar Ramaswamy and Prithviraj Banerjee. Autom atic generation of efficient array redistribution routines for distributed memory multiprocessors. Technical Report UILA-ENG-94-2213, CRHC-94-09, University of Illinios, April 1994.

[54] H. N. Rojas, R. W . Godby, and R. J. Needs. “Space-time method for Ab-initio calculations of self-energies and dielectric response functions of solids” . Phys. Rev. Lett., 74:1827, 1995.

138

[55] Loren Schwiebert and D. N. Jayasimha. "Optimal Fully Adaptive Minimal Wormhole R outing for Meshes” . Journal o f Parallel and Distributed Computing, 2 7 (l):56 -70 , May 1995.

[56] Ravi Sethi. “Complete register allocation problems” . SIA M Journal of Computing, 4(3):226-248, September 1975.

[57] Ravi Sethi and J. D. Ullman. “The generation of optim al code for arithm etic expressions” . Journal o f the Association fo r Computing Machinery, 17(l):71-5- 728, O ctober 1970.

[58] Sharad Singhai and Kathryn McKinley. “Loop fusion for data locality and parallelism” . In Mid-Atlantic Student Workshop on Programming Languages and Systems, SUN Y at New Paltz, April 1996.

[59] Rajeev Thakur and Alok Choudhary. “A ll-to-all communication on meshes with wormhole routing” . In International Parallel Processing Symposium, pages 561- 565, April 1994.

[60] Rajeev Thakur, Alok Choudhary, and Geoffrey Fox. “Runtime array redistribution in HPF programs” . In Scalable High Performance Computing, pages 309-316, May 1994.

[61] Rajeev Thakur, Alok Choudhary, and J. Ramanujam. “Efficient algorithms for array redistribution”. IEEE Transactions on Parallel and Distributed Systems, 7(6):587-594, June 1996.

[62] Yih-Jia T sai and Philip K. McKinley. “An extended dom inating node approach to collective communication in all-port wormhole-routed 2D meshes” . In Scalable High Performance Computing Conference, pages 199-206, Knoxville, TN, May 1994.

[63] Akiyoshi Wakatani and Michael Wolfe. “A new approach to array redistribution: Strip M ining Redistribution” . In Parallel Architecture and Languages Europe, pages 323-335, July 1994.

[64] Akiyoshi W akatani and Michael Wolfe. “O ptim ization of Array Redistribution for D istributed Memorj' M ulticomputers” . Parallel Computing, 21(9):1485-1490, September 1995.

[65] S. W inograd. Arithmetic Complexity of Computations. Society for Industrial and Applied M athematics, Philadelphia, 1980.

[6 6 ] P. W inter. “Steiner problems in networks: a survey” . Networks, 17:129-167, 1987.

139

[67] Michael E. W olf and Monica S. Lam. “A data locality optimization algorithm”. In S IG P L A N ’91 Conf. on Programming Language Design and Implementation. pages 30-44, Toronto, Canada, June 1991.

[68] Michael Wolfe. High Performance Compilers fo r Parallel Computing. Addison- Wesley, 1996.

[69] Jingling Xue. “Communication-minimal tiling of uniform dependence loops” . In Languages and Compilers fo r Parallel Computing, pages 330-349, San Jose, August 1996.

140

2.8 fast fourier transform

Documents