2.8 fast fourier transform
TRANSCRIPT
INFORMATION TO USERS
This manuscript has been reproduced from the microfilm master. DM! films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer.
The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment
can adversely affect reproduction.
In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion.
Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back of the book.
Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6” x 9” black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order.
UMI'Bell & Howell Information and beaming
300 North Zeeb Road, Ann Artx)r, Ml 48106-1346 USA 800-521-0600
PERFORMANCE OPTIMIZATION OF A CLASS OF LOOPS IMPLEMENTING MULTI-DIMENSIONAL
INTEGRALS
DISSERTATION
Presented in P artial Fulfillment of the Requirements for
the Degree Doctor of Philosophy in the
Graduate School of The Ohio State University
By
Chi Chung Lam, B.S.C.LS., M.S.
* * * * *
The Ohio State University
1999
Dissertation Committee:
Professor Ponnuswamy Sadayappan, Ad- viser
Professor Dhabaleswar K. Panda
Professor Rephael Wenger
Professor Gerald Baumgartner
Approved by
AdviserDepartment of Computer and Information Science
UMI Number; 9941367
UMI Microform 9941367 Copyright 1999, by UMI Company. All rights reserved.
This microform edition is protected against unauthorized copying under Title 17, United States Code.
UMI300 North Zeeb Road Ann Arbor, MI 48103
ABSTRACT
Multi-dimensional summations, or discretized integrals, involving products of sev
eral arrays arise in scientific computing, e.g. in calculations that model electronic
properties o f semiconductors and metals. This thesis addresses the performance op
tim ization of a class of loops that implement such multi-dimensional summations.
The optimization measures considered are arithm etic operation count, communica
tion cost, and memory usage.
The goal of the operation m inimization problem is to seek an equivalent sequence
of nested loops that computes a given summation using a minimum number of arith
m etic operations. The problem is proved to be NP-com plete and an efficient pruning
search algorithm is developed for finding an optim al solution.
Due to the potentially large sizes of intermediate arrays in the synthesized optimal
solution, it is imperative to reduce the memory usage by loop fusion and loop reorder
ing transformations. We analyze the relationship between loop fusion and memory
usage and present algorithms for finding loop fusion configurations that minimize
memory usage under static and dynamic memory allocation models.
In evaluating the sums in a multi-processor environment, the partitioning of the ar
rays among processors determines the inter-processor communication overhead. The
processors are modeled as a logical multi-dimensional processor grid, with each array
to be distributed or replicated along one or more processor dimensions. A dynamic
ii
programming algorithm is developed to determine an optimal partitioning o f data and
operations among processors that minimizes the communication and com putational
costs- We also describe two approaches for determining the appropriate loop fusions
and array distributions that minimizes communication cost without exceeding a given
memory limit.
After initially developing the solutions to the various optimization problems in the
context of dense arrays, we enhance them to address the practically significant issues
o f sparsity, use of fast Fourier transforms, and utilization of common sub-expressions.
Ill
Dedicated to my wife
IV
ACKNOW LEDGM ENTS
I wish to thank my advisor, P. Sadayappan, for intellectual support, encourage
ment, and enthusiasm which made this thesis possible, and for his patience in cor
recting both my stylistic and scientific errors.
I thank Rephael Wenger, Gerald Baumgartner, and Daniel Cociorva for stim ulat
ing discussions on various aspects o f this thesis.
This research was supported in part by a grant from the National Science Foun
dation.
VITA
February 22, 1962 ..................................................Born - Hong Kong
1995 .............................................................................. B.S. Computer and Information Science, The Ohio State University
1998 .............................................................................. M.S. Computer and Information Science, The Ohio State University
1995-present .............................................................. Graduate Fellow, The Ohio State University
PUBLICATIONS
R esearch P ub lications
Chi-Chung Lam, Daniel Cociorva, Gerald Baumgartner, and P. Sadayappan. "Optim ization of memorv' usage and communication requirements for a class of loops implementing multi-dimensional integrals” . In Languages and Compilers fo r Parallel Computing, San Diego, August 1999.
Chi-Chung Lam, P. Sadayappan, Daniel Cociorva, Mebarek Alouani, and John Wilkins. “Performance optim ization of a class of loops involving sums of products o f sparse arrays” . In Ninth SIA M Conference on Parallel Processing fo r Scientific Computing, San Antonio, TX, March 1999.
Chi Chung Lam and Wu-Chi Feng. “Approximating Cumulative Bandwidth Requirements for The Delivery of Stored Video” . In Interworking, Ottawa, Canada, July 1998.
Chi-Chung Lam, P. Sadayappan, and Rephael Wenger. “On optim izing a class of multi-dimensional loops with reductions for parallel execution”. Parallel Processing Letters, 7(2):157-168, 1997.
vi
Chi Chung Lam, C.-H. Huang, and P. Sadayappan. "Optimal algorithms for all- to-ail personalized communication on rings and two-dimensional tori". Journal of Distributed and Parallel Computing, 43:3-13, 1997.
Chi-Chung Lam, P. Sadayappan, and Rephael Wenger. "Optimization of a class of multi-dimensional integrals on parallel machines". In Eighth SIAM Conference on Parallel Processing fo r Scientific Computing,, Minneapolis. MN, March 1997.
Chi Chung Lam. “An Efficient Distributed Channel Allocation Algorithm Based on Dynamic Channel Boundaries” . In International Conference on Network Protocols, Columbus, Ohio, October 1996.
Chi-Chung Lam, P. Sadayappan, and Rephael Wenger. “Optimal reordering and mapping of a class of nested-loops for parallel execution”. In Languages and Compilers fo r Parallel Computing, San Jose, August 1996.
FIELDS OF STUDY
Major Field: Computer and Information Science
Studies in:
High Performance Computing Prof. P. SadayappanData Mining Prof. Reneè MillerNetworking Prof. Raj Jain and Prof. Wu-Chi Feng
VII
TABLE OF CONTENTS
Page
A b str a c t...................................................................................................................................... ii
D ed ication ...................................................................................................................................... iv
Acknowledgm ents.................................................................................................................... v
V i t a ................................................................................................................................................ vi
List of T a b le s ........................................................................................................................... xi
List o f Figures ....................................................................................................................... xii
Chapters:
1. Introduction .................................................................................................................... 1
1.1 M o t iv a t io n ......................................................................................................... 21.2 Form of Multi-Dimensional S u m m a tio n s ................................................ 41.3 M ethodologj^ ..................................................................................................... 51.4 An Example ..................................................................................................... 81.5 Related W o rk ..................................................................................................... 141.6 Organization of this D is se r ta tio n ............................................................... 18
2. Operation M in im iza tio n ........................................................................................... 20
2.1 Problem D escr ip tio n ....................................................................................... 212.2 Formalization of the Optimization P r o b le m ......................................... 232.3 Expression Tree R epresentation .................................................................. 262.4 N P-C om pleteness.............................................................................................. 272.5 A Pruning Search Procedure ..................................................................... 322.6 Common Sub-E xpressions............................................................................ 36
viii
2.7 Sparse A rra y s ........................................................................................................ 382.8 Fast Fourier Transform ...................................................................................... 412.9 An Example ........................................................................................................ 43
3. Memory Usage M inim ization....................................................................................... 45
3.1 Introduction ....................................................................................................... 463.2 P r e lim in a r ie s ....................................................................................................... 523.3 Static Memory A llo c a t io n .............................................................................. 563.4 M emory-Optimal Evaluation Order of Unfused Expression Trees . . 64
3.4.1 Problem S ta tem en t............................................................................... 653.4.2 An Efficient A lgorithm ........................................................................ 693.4.3 Correctness of the A lg o r ith m ......................................................... 74
3.5 Dynamic Memory A l lo c a t io n ........................................................................ 813.6 Common S u b -E xpressions............................................................................... 883.7 Sparse A rra y s ....................................................................................................... 943.8 Fast Fourier Transform ...................................................................................... 973.9 Further Reduction in Memory U s a g e ......................................................... 983.10 An Example ................. , .................................................................................. 100
4. Communication M in im iza tio n ................................................................................... 105
4.1 P re lim in a r ie s ....................................................................................................... 1064.2 Application to Matrix M u ltip lica tio n ......................................................... 1094.3 A Dynamic Programming A lg o r ith m ......................................................... 1124.4 An Example ....................................................................................................... 1144.5 Common S u b -E xpressions.............................................................................. 1154.6 Sparse A rra y s ....................................................................................................... 1174.7 Fast Fourier Transform ..................................................................................... 1184.8 Communication Minimization with Memorv' C o n stra in t..................... 118
4.8.1 P re lim in a r ie s ......................................................................................... 1194.8.2 Two A p p ro a ch es .................................................................................. 122
5. C o n c lu s io n s ...................................................................................................................... 127
5.1 Research Topics for Further Pursuit ......................................................... 1295.1.1 Generalization of The Class of Nested-Loop Computations
H a n d le d .................................................................................................... 1295.1.2 Optim ization of Cache Performance .......................................... 1305.1.3 Optim ization of Disk Access C o s t .................................................. 1305.1.4 Development of an Autom atic Code G e n e r a t o r ..................... 132
IX
Bibliography ............................................................................................................................. 134
LIST OF TABLES
Table Page
1.1 Variables in an example physics com putation................................................. 3
3.1 Solution sets for the subtrees in the example.................................................. 61
3.2 Memory usage of three different traversals of the expression tree inFigure 3.8..................................................................................................................... 67
3.3 The seqsets for the fusion graph in Figure 3.12(a)........................................ 87
4.1 Optimal array distributions for an example formula sequence...................... 115
XI
LIST OF FIGURES
Figure Page
1.1 Two ways to compute the same multi-dimensional sum m ation.............. 9
1.2 Three loop fusion configurations......................................................................... 11
2.1 An example expression tree.................................................................................. 26
2.2 Different expression trees for a m ultiplication decision problem instance. 30
2.3 Expression tree transformations for the first pruning rule........................ 33
2.4 The expression tree transformation for the second pruning rule. . . . 35
2.5 Sparsity entries and sparsity graphs of arrays............................................... 40
3.1 An example multi-dimensional sum m ation and two representations ofa computation............................................................................................................ 47
3.2 Three loop fusion configurations for the expression tree in Figure 3.1. 48
3.3 Algorithm for static memory allocation............................................................ 58
3.4 Algorithm for static memory allocation, (cont.) ........................................ 59
3.5 Algorithm for static memory allocation, (cont.) ........................................ 60
3.6 Algorithm for static memory allocation, (cont.) ........................................ 61
3.7 An optimal solution for the exam ple................................................................ 63
3.8 An example unfused expression tree................................................................. 65
xii
3.9 Procedure for finding an m emor}-optimal traversal of an expression tree. 71
3.10 Optimal traversals for the subtrees in the expression tree in Figure 3.8. 73
3.11 Memory usage comparison of two traversals in Lemma 3 .......................... 78
3.12 A fusion graph with equal fusions and its two evaluation orders. . . . 83
3.13 Algorithm for dynamic memory allocation...................................................... 84
3.14 Algorithm for dynamic memorj^ allocation ( c o n t . ) .................................... 85
3.15 An example multi-dimensional summation with common sub-expressions and representations of a com putation................................................................ 90
3.16 Examples o f illegal fusion graphs for a DAG.................................................. 91
3.17 Examples o f legal fusion graphs and corresponding loop fusion configurations for a DAG ................................................................................................... 93
3.18 An example of legal loop fusions for sparse arrays........................................ 94
3.19 Illegal fusion graphs due to representations of sparse arrays.................... 95
3.20 Fused size of sparse arrays..................................................................................... 96
3.21 Loop fusions for an FFT node.............................................................................. 97
3.22 The DAG representation of an example formula sequence........................ 101
3.23 Optimal loop fusions for the example formula sequence............................. 104
4.1 An example expression tree................................................................................... 106
4.2 Expression tree representation of matrix m ultiplication............................. 109
4.3 The expression tree in Figure 3 .1(c)................................................................... 120
Xlll
C H A PT E R 1
INTR O D U CTIO N
This thesis addresses the optim ization of a class of loop computations that arise
in the implementations of m ulti-dim ensional integrals, or summations, of the product
of several arrays. Such integral calculations arise, for exam ple, in the computation
of electronic properties of semiconductors and metals [6, 24, 54]. The objective is
to minimize the execution tim e of such com putations on a parallel computer with
memory constraints. In addition to the performance optim ization issues pertaining
to inter-processor communication, there is the opportunity to apply algebraic trans
formations using the properties o f com m utativity, associativity, and distributivity, to
minimize the total number of arithm etic operations. Since the input and intermediate
arrays are often too large to fit into the available memory, loop fusion is necessary to
reduce the array sizes and hence the memory usage.
In the context of the class of loops that compute multi-dimensional summations,
we consider three optim ization problems:
1. the m inimization of arithmetic operations through the application of the alge
braic laws of associativity, com m utativity, and distributivity:
2. the minimization of memory usage by loop fusions, loop reordering, and chang
ing the order of evaluation; and
1
3. the miniinizatioa of inter-processor communication and computational costs by
appropriate partitioning of data and computations on processors.
Section 1.1 gives an example of the computational physics applications that mo
tivate this research. The form of multi-dimensional sum m ations is formalized in
Section 1.2. An overview of our approach to solving the optim ization problems is
provided in Section 1.3 and illustrated by an example in Section 1.4. Section 1.5
provides a discussion of related work. The organization of the rest of this thesis can
be found in Section 1.6.
1.1 M otivation
W ith the increase in power of supercomputers, more scientific computations be
come feasible with higher accuracies. In one class of scientific computations, the final
result to be computed can be expressed as a multi-dimensional integral of the products
of several arrays that represent some physical quantities. One computational physics
application that m otivates this research is the calculation of electronic properties of
MX materials with the inclusion of many-body eflfects [6]. MX materials are linear
chain compounds with alternating transition-metal atoms (M = Ni, Pd, or Ft) and
halogen atoms (X = Cl, Br, or I). The following multi-dimensional integral computes
the susceptibility in momentum space for the determination of the self-energy of MX
compounds in real space.
\k,G i-v*k,GXg,g'(k, ir) — z ^R"£,",R'L'
where
■ R' ",R'£,'(* ) = dr e ^ ^R'£,',R"'L"'(r) GR"£,'',R'"£,"'(z'r)V i t j ^ / / / iftf
Variable RangeRL Orbital 10-*r Discretized points in real space 10^T Time step 10-k K-point in a irreducible Brilloin zone 10G Reciprocal lattice vector 10^
Table 1.1: Variables in an example physics computation.
and
— ^R'L'(r — R') (r — R'")
In the above equations, $ is the localized basis function, G is the orbital pro
jected Green function for electrons and holes, and D is computed using a fast Fourier
transform (FFT). The interpretation of the variables and their ranges are given in
Table 1.1. After some simplifications and rewriting ^ as a two-dimensional array Y ,
the integral can be expressed as the following multi-dimensional summation of the
product of several arrays. Note that array Y is sparse since ^ is a localized function.
5 3 V[r, RL] X F[r, RL2] x V [rl, RL2>\ x V [rl, R L l] x G [R L Y RL, t]r,rl,RL,RLl,RL2,RL3
xG [RL2, RL3, £] x exp[k, r] x exp[G, r] x exp[k, r l] x exp[G l, rl]
The large sizes of the arrays could potentially lead to very high computational
costs and memory usage. For example, computing the discretized integral using the
above equations involves 3.54 x ICA arithmetic operations and would take more than
six days on a 6.4 gigaflop machine. If unoptimized, the size of the intermediate
array D would be 10 elements. Moreover, moving these large arrays among the
processors in a message-passing parallel machine could incur a large communication
overhead. Therefore, optim ization of th is kind of com putations on parallel computers
is important.
Scientists who need to calculate such multi-dimensional integrals usually attem pt
to optimize the com putation manually. T hey make decisions on implementation pa
rameters, such as which intermediate results to calculate as stepping-stones to the
final discretized integral: which loops to fuse to reduce array sizes: and how the ar
rays and the computation should be partitioned among the processors. However, the
number of possible choices for the im plem entation parameters is very large and the
manual optim izations may not yield an optim al way to compute the discretized in
tegrals. Hence, there is a need to find the optim al solution algorithmically, which is
the goal o f this thesis.
We expect the results and algorithms presented in this thesis to be applicable
not only to physics computations but also to multi-dimensional integral calculations
in other scientific areas. We have recently learnt about com putational chemistry
calculations that share similar forms to the discretized integrals addressed here. Also,
the algorithms for minimizing memory usage should be useful in other computer
applications such as register allocation or database query optimization.
1.2 Form of M ulti-Dim ensional Summations
The multi-dimensional summations (i.e., discretized integrals) we consider in this
thesis have the following general form.
^o[fo] = S X ^ 4 . 2 X . . . X .4n[/„]}n £2 ifc
where Ij represents the indices of array A j and is an ordered list of distinct indices
from the set {ii,z'a, im}- Since A q is the result of the sum m ation, its set
4
of indices Iq equals {zt+i, - - -, %m}- We assum e each index i j has a constant range
of 1 to N j and the array entries are directly referenced by a list of distinct indices.
Thus, for exam ple, we do not consider a product term such as -4i[zi,zi] or A 2 \i\ — 22]-
A m ulti-dim ensional summation in the above form can be implemented as a set of
perfectly-nested loops of the following form, with a single reduction statem ent as its
loop body.
for z'l = 1 to N i for %2 = 1 to No
for im = 1 to N m Ao[/o] = -4o[/o] + X Ao[/2] x . . . x -4„[/n]
endfor
endfor endfor
In practice, some of the arrays in the product terms occur more than once in the
multi-dim ensional summation (with each occurrence possibly associated with differ
ent index variables) and some of them are sparse. Moreover, some of the product
terms are exponential functions and thus permit some products to be computed more
efficiently using a fast Fourier transform rather than an explicit matrix-vector prod
uct. We also develop an optimization framework that appropriately models common
sub-expressions, sparsity, and fast Fourier transform (FFT) operations.
1.3 M ethodology
For the class o f loops that compute m ulti-dimensional summations, we are inter
ested in optim izing the following three performance metrics:
• the number of arithmetic operations, which can be reduced by applying alge
braic laws to form and reuse intermediate results:
• memory usage, which is affected by loop fusions and the order of evaluation;
and
• communication overhead, which depends on the distribution of data and com
putations among processors.
It would be desirable is to consider the optim ization of the three performance
metrics in an integrated fashion. This integrated optim ization problem can be stated
as: given a multi-dimensional summation of the product of a list of arrays in the form
specified in Section 1.2 and a limit on the amount o f available memory, find a parallel
program to com pute the given summation in minimum time, using no more memory
than specified. The form of a parallel program is a (possibly imperfectly-nested) loop
structure containing array evaluation statements and array redistribution statements.
However, due to the large number of possible combinations of ways to apply the
algebraic laws, to fuse the loops, and to partition the data and operations, the in
tegrated optim ization problem is extremely complex. In order to make the overall
performance optim ization problem more tractable, we view it in terms of three inde
pendent optim ization problems:
1. Given a specification of the required computation as a multi-dimensional sum
of the product of input arrays, determine an equivalent sequence of nested loops
that com putes the result using a minimum number of arithmetic operations.
2. Given an operation-count-optimal form of the com putation (determined by solv
ing the above sub-problem), apply loop transformations such as loop fusion, loop
6
nest reordering, and loop permutation to reduce the memory usage to within
the available amount o f memory.
3. Given a sequence of loop computations to be performed on each processor (from
solution of the two above sub-problems), determine the data distribution of
input, intermediate and result arrays, and the mapping of computations among
processors to minimize communication cost for load-balanced parallel execution.
Solving the three optim ization problems above may not produce an optimal so
lution to the integrated optim ization problem, but the solution produced should be
close to optimal in practice.
In minimizing arithmetic operations by applying algebraic laws, we assume that
the algebraic laws can be applied freely without jeopardizing numerical stability. If
numerical stability becomes a concern in computing some multi-dimensional summa
tions, we can disable the unsafe applications of algebraic laws to certain arrays that
are sensitive to evaluation orders.
For each of the optim ization problems, we do the following.
• We analyze and characterize the problem in terms of its solution space and the
performance metrics.
• We design an efficient algorithm for finding an optim al solution, incorporating
pruning rules and/or dynamic programming techniques to reduce the complexity
of the algorithm.
• We extend the algorithm to handle sparse arrays, common sub-expressions, and
fast Fourier transforms, which are characteristics of many practical scientific
computations.
7
• We apply the algorithm on practical physics com putations to show the effec
tiveness o f the algorithm.
1.4 An Example
As an illustration of the three optim ization problems described in the previous sec
tion, consider the following multi-dimensional sum m ation, where the array elements
are floating point numbers.
= E Z X A:, X C[A:, f]): i I
A naive way to com pute W[k] for all values of k is to have a single perfect loop nest
such as:
initialize S for i
for j for k for 1W[K]+=A[i, j] xB[j ,k,l] xC[k,l]
Computing W this way takes 3 x Ni x Nj x Nk x Ni floating point arithmetic
operations. However, we can apply algebraic laws of com mutativity, associativity, and
distributivity to rearrange the multiplications and reductions in the multi-dimensional
summation. Doing so may reduce the number o f arithm etic operations since some
index variables are m issing in some arrays. The goal o f the operation minimization
problem is to find an operation-minimal rearrangement of the m ultiplications and
reductions for a given multi-dimensional summation.
In our example, since C[k, Z] does not depend on i and we can move it out
of Y.i and XTj and get the equation and the corresponding loop structure as shown
in Figure 1.1(a). Here, / i , / 2 , and /a are intermediate arrays. By computing the
8
== E / ( E i X B\j ,kJ \) X C[kJ\) = E / E . # ' , ; ] X Y.iiB\j,kA] X C[fcj]))
for i for j
for k for 1[ f l[i, j ,k,l]=A[i, j] xB[] ,k,l]
initialize f2 for i
for j for k for 1 f 2 [k, 1] +=f 1 [i, j , k, 1]
for k for 1 f3[k,l]=f2[k,l] xC[k,l]
initialize W for k for 1W[k]+=f3[k,l]
initialize f1 for i for j [ fl[j]+=A[i,j]
for j for k for 1[ f2 [ j ,k ,l ] = B C j ,k ,l ] xCCk.l]
initialize f3 for j
for k for 1[ f3[j ,k] +=f2[j ,k,l]
for j for k f4[j ,k]=fl[j] xf3 [j ,k]
initialize W for j for kW[k]+=f4[j ,k]
(a) (b)
Figure 1.1: Two ways to compute the same multi-dimensional summation.
intermediate arrays, the number of multiplications invoking C is lowered by a factor
of Ni X N j. This reduces the number of arithmetic operations to 2 x Ni x Nj x Nk x Ni +
2 X Nk X Ni. But this is not the operation-minimal way to compute W . The optimal
form is shown in Figure 1.1(b), which takes only 2 x N j x Nk x Ni-\-Ni x Nj-\-'2 x Nj x Nk
operations and represents an order of magnitude improvement.
In general, there are many ways to rearrange the multiplications and reductions
in a given multi-dimensional summation and they result in different number of arith
metic operations. Since finding the operation-minimal rearrangement is not trivial,
an automated procedure for determining the optimal solution is needed.
Once the operation-minimal rearrangement is determined, the next step is to
implement it as some loop structure. A straightforward way to generate a sequence
of perfect loop nests, each evaluating an intermediate array, such as in Figure 1.1.
However, the intermediate arrays in practical scientific applications could be too large
to fit into memory. The reduction of array sizes by reordering and fusing the loops is
desirable.
Consider the operation-minimal form and the corresponding loop structure shown
in Figure 1.1(b). For now, we assume the input arrays can be generated one element
at a time (by the genu function for array v). Figure 1.2(a) shows the loop structure
that includes the loop nests for generating the input arrays. Note that no loop fusion
has occurred. Under static memory allocation, the total memory usage of the loop
structure is the total size of the arrays, which is 2 x N j x Nk x A) -h 2 x N j x A t +
Ni X N j -f- Nk X A 4- Nj 4- Ajfc.
Suppose we fuse all the loops around the evaluations of A and / i , all the loops
around those of B , C, /g , and /a , and also all the loops around those o f A and
10
for i for j[ ACi,j]=genA(i,j)
for j for k
for 1[ B[j,k,l]=genBCj.k,l)
for k for 1[ C [k. 1] =genC (k. 1)
initialize fl for i
for j[ fl[j]+=A[i.j]
for j for k
for 1[ f2 [ j ,k , l ] = B [ j ,k . l ] x C [ k , l ]
initialize f3 for j
for k for 1[ f3Cj,k]+=f2[j,k,l]
for j for k[ f4[j ,k]=f iCj] xf3[j ,k]
initialize fS for j
for k[ f5[k]+=f4[j.k]
initialize fl for i
for jr A=genACi.j)L flCj]+=A
initialize f3 for k r for 1
C=genC(k,l) for j
B=genB(j,k,l) f2=BxC f3[j.k]+=f2
initialize fS for j
for kr f4=fl[j]xf3[j.k]I f5[k]+=f4
for k for 1[ C[k,l]=genCCk.l)
initialize f5 for j ■ for i
[ A[i]=genA(i,j)initialize fl for i [ fl+=A[i] for k
initialize f3 for 1
B=genB(j,k,l) f2=BxC[k,l] f3+=f2
f4=flxf3 fSCk]+=f4
(a) (b) (c)
Figure 1.2: Three loop fusion configurations.
11
/a (after some suitable loop reordering). The resulting loop structure is shown in
Figure 1.2(b). Once a loop (say, a (-loop) around the evaluation of an array v and the
consum ption of the same array is fused, the (-dim ension of the array v is no longer
necessary- and can be eliminated. This reduces the size of the array v by a factor of
Nt- If all the loops around the evaluation and the consum ption of the same array are
fused, the array can be replaced by a scalar variable. Thus, .4, B , C, /?, and / t are
reduced to scalars.
Figure 1.2(c) shows another possible way to fuse the loops. Here, we first fuse all
the j-loops and then fuse all the A:-loops and /-loops inside them . Doing so reduces
the sizes o f all arrays except C and f^. By fusing the j - , k-, and /-loops around the
evaluation and the consumption of those arrays, the j - , k-, and /-dimensions of those
arrays are eliminated. Hence, B, f i , /g, fs , and / t are reduced to scalars while .4
becomes a one-dimensional array.
However, in many cases, we cannot fuse all the loops and reduce all arrays to
scalars. Instead, loop fusions could be mutually exclusive; fusing one loop may prevent
the fusion o f another. For example, in Figure 1.2(b), w ith appropriate loop reordering,
we can further fuse either the y-loops around the evaluation and the consumption of
f l or the A:-loops around the evaluation and the consum ption of /a, but not both.
Determ ining a set of loop fusions and loop reordering that minimizes overall memory
usage is the objective of the memory usage m inim ization problem.
In implementing a nested loop structure on a message-passing parallel computer,
the partitioning of the arrays and computation am ong the processors has to be de
cided. Assume that no loop fusion has taken place so that arrays are in their full sizes.
We take a logical multi-dimensional view of the processors so that each array can be
12
distributed or replicated along one or more of the processor dimensions. Suppose
the processors are viewed as a two-dimensional array. In the running example, one
of many possible choices is to distribute the j-dim ension of array B (referenced by
B[j, k, /]) along the first processor dimension and the A;-dimension along the second
processor dimension. Similarly, we can distribute the A:-dimension of array C (refer
enced by C[A:,Z]) along the first processor dimension and the /-dimension along the
second processor dimension. Before evaluating fo, operand arrays B and C need to
be aligned so that operand elements of B and C are on the same processor as the local
portion of /2 to be formed. One way to align B and C is to redistribute C so that its
fc-dimension is distributed along the second processor dimension and its distributed
portions are replicated along the first processor dimension. Then, multiplications can
take place on the processors and the product array /a will have the same distribution
as B. Alternatively, we can redistribute B so that it has the same distribution as C
and the product array will have the same distribution as C . The communication
cost incurred in redistributing C using the first alternative may be different from the
cost in redistributing B using the second alternative. Moreover, depending on how
/2 is used in evaluating /a, the resulting distribution of /2 with one alternative may
be better than with the other alternative.
Given a nested loop structure, there are a large number of ways to partition
the arrays and computation among the processors. They could vary significantly in
terms of communication and computational cost. The goal of the communication
minimization problem is to find a partitioning for the arrays and computation that
minimizes the total communication and com putational cost.
13
1.5 Related Work
Reduction of arithmetic operations has been traditionally done by compilers us
ing the technique of common sub-expression elimination [17]. Potkonjak et al. [50]
considered the multiple constant m ultiplication problem, in which a large number
of multiplications of one variable with several constants need to be performed. The
number of shifts is first minimized, and then the number of additions is minimized us
ing an iterative pairwise matching heuristic for common sub-expression elimination.
Their problem formulation can be applied to other high level tasks such as linear
transforms and single and multiple polynomial evaluations. Here, we are minimizing
the total number of multiplications and additions.
Lu and Chew [43] presented a fast algorithm to solve for the scattered field of
a two-dimensional, dielectric-coated conducting cylinder using a hybrid of combined
field surface integral equation and volume integral equation. The algorithm relies on
the translation of scattering centers to speed up the matrix-vector multiplication in
the conjugate gradient method. An efficient approach is used to reduce to floating
point operations from 0(iV"-) to 0{N^'^).
Winograd [65] addressed the general problem of evaluating multiple expressions
that share common variables using the minimum number of arithmetic operations.
This problem was motivated by Strassen’s algorithm for matrix multiplication. How
ever, we do not consider Strassen’s algorithm as an array expression evaluation
method that can be synthesized automatically.
Miller [48] suggested several analytical and numerical techniques for reducing the
operation count in computational electromagnetic applications. These techniques in
clude reducing the spatial-sample count, filling impedance matrix using model-based
14
parameter estim ation, using fast Fourier transforms in iterative solutions, applying
near-neighbor approximations, reducing the number of near-neighbor coeS cients, and
so on. These and similar techniques are useful in reducing the com plexity in mod
eling physical system s. Some of them have already been used in forming the multi
dimensional sum m ation that models electronic structures in Section 1.1.
Loop transformations that improve locality and parallelism have been studied ex
tensively in recent years. For instance. W olf and Lam [67] formulated cache reuse
and locality m athem atically and presented a loop transformation theory that unifies
various transforms as unimodular m atrix transformations. Based on their formula
tion, they proposed an algorithm for improving the locality of a loop nest by loop
interchange, reversal, skewing and tiling. Kennedy and McKinley [28] explored the
tradeoff between data locality and parallelism. They presented a memory m odel to
determine cache line reuse from both multiple accesses to the same memory location
and from consecutive memory access. However, we are unaware of any work on loop
transformation based on the distributive law as a means of minimizing arithm etic
operations.
Much work has been done on improving locality and parallelism by loop fusion.
Kennedy and McKinley [29] presented a new algorithm for fusing a collection of loops
to minimize parallel loop synchronization and maximize parallelism. They proved
that finding loop fusions that maxim izes locality is NP-hard. Two polynom ial-tim e
algorithms for improving locality were given.
Singhai and McKinley [58] exam ined the effects o f loop fusion on data locality
and parallelism together. They viewed the optim ization problem as a problem of
partitioning a weighted directed acyclic graph, in which the nodes represent loops
15
and the weights on edges represent amount o f locality and parallelism. Although the
problem is NP-hard, they were able to find optim al solutions in restricted cases and
heuristic solutions for the general case.
However, the work in this dissertation considers a different use o f loop fusion,
which is to reduce array sizes and memory usage o f autom atically synthesized code
containing nested loop structures. Traditional compiler research does not address
this use of loop fusion because this problem does not arise with manually-produced
programs.
Gao et al. [18] studied the contraction of arrays into scalars through loop fusion
as a means to reduce the overhead of array accesses. They partitioned a collection
of loop nests into fusible clusters using a max-fiow min-cut algorithm, taking into
account the data dependence constraints. However, their study is motivated by data
locality enhancement and not memory reduction. Also, they only considered fusions
of conformable loop nests, that is, loop nests that contain exactly the same set of
loops.
Loop fusion in the context of delayed evaluation o f array expressions in compiling
APL programs has been discussed by Guibas and W yatt in [22]. They presented an
algorithm for incorporating the selector operators into the accessors for the leaf nodes
of a given expression tree. As part o f the algorithm, a general buffering mechanism is
devised to save portions of a sub-expression that will be repeatedly needed, to avoid
future recomputation. They considered loop fusion without any loop reordering, and
it is also not aimed at minimizing array sizes. We are unaware of any work on fusion
and reordering of multi-dimensional loop nests into im perfectly-nested loops as a
means to reduce memory usage.
16
A simpler problem related to the memory usage m inimization problem is the
register allocation problem in which the sizes o f all nodes are unity and loop fusion is
not considered. The goal is to find an evaluation order o f a given binary expression
tree that minimizes the number of registers needed. It has been addressed in [49, 57]
and can be solved in 0 { n ) time, where n is the number of nodes in the expression
tree. For each parent node in the expression tree, the strategy is to evaluate the
child subtree that uses more registers before evaluating the other child subtree. But
if the expression tree is replaced by a directed acyclic graph (in which all nodes are
still of unit size), the problem becomes NP-com plete [56]. The algorithm in [57] for
expression trees of unit-sized nodes does not extend directly to expression trees having
nodes of different sizes. Appel and Supowit [5] generalized the register allocation
problem to higher degree expression trees of arbitrarily-sized nodes. However, the
problem they addressed is slightly different from ours in that, in their problem, space
for a node is not allocated during its evaluation. Also, they restricted their attention
to solutions that evaluate subtrees contiguously, which is sub-optim al in some cases.
Several researchers have investigated issues pertaining to autom atic mapping of
data and computation onto parallel machines. Chatterjee et al. [10, 11] considered
the optimal alignment of arrays in evaluating array expression using data-parallel
languages such as Fortran 90 on massively parallel machines. Alignment functions
are decomposed into axis, stride, and offset. They presented dynamic programming
algorithms that solve the optimal alignment for several communication cost metrics:
multi-dimensional grids and rings, hybercubes, fat-trees, and the discrete metric.
However, they do not consider distribution and replication o f arrays.
17
Anderson and Lam [4] developed global loop transformations under a linear al
gebraic framework that handles ver\' general loop forms. They described a compiler
algorithm that autom atically finds computation and data decom positions that op
timize both parallelism and locality. The algorithm models both distributed and
shared address space machines. But it only handles dense matrices where the array
subscripts are affine functions of the loop indices.
Gupta and Banerjee [23] have developed a constraint-based heuristic strategy"
for automatic data distribution in the PARADIGM compiler. The compiler makes
data partitioning decisions for Fortran 77 procedures. They decomposed the data
partitioning problem into a number of sub-problems, each dealing with a different
distribution parameter for all the arrays, and presented algorithms that determine
those parameters.
The communication minimization problem we consider differs from these related
works in that we address a more restricted form of the data/com putation mapping
problem, but evaluate many more data/com putation mapping possibilities, including
data/com putation replication and/or distribution on a subset of processors.
1.6 Organization of this Dissertation
The rest of this dissertation deals with the three optim ization problems, namely
operation minimization, communication minimization, and memory usage minimiza
tion. Each chapter addresses an optimization problem by first considering the case
where all arrays are dense and the computation can be modeled as an expression tree
with sum and m ultiply operators. The solutions are then extended to include sparse
18
arrays, use of fast Fourier transforms, and the utilization of common sub-expressions
(which results in directed acyclic graphs instead of expression trees).
Chapter 2 describes the operation m inim ization problem and shows by an example
how the implemented pruning search algorithm finds solutions that are better than the
best manually-optimized solutions. The operation minimization algorithm has been
implemented and has been used to obtain significant improvement in the number of
operations for self-energy electronic structure calculations in a tight-binding scheme.
In Chapter 3, we study the use o f loop fusion to reduce intermediate array sizes and
present algorithms for finding the optim al loop fusion configuration that minimizes
memory usage. Both static and dynamic memory allocation models are considered.
Chapter 4 considers the optim al partitioning of data and loops to minimize com
munication and computational costs for execution on parallel machines. Two ap
proaches to minimizing communication and com putational cost on parallel computers
under memory constraint are also described.
Chapter 5 provides conclusions and describes some research topics that may be
pursued in the context of optim izing multi-dim ensional summation calculations.
19
C H A PTER 2
OPERATION M INIMIZATION
In the class o f computations considered, the final result to be computed can be
expressed as multi-dimensional summations of the product of many input arrays. Due
to com m utativity, associativity, and distributivity, there are many different ways to
obtain the same final result and they could differ widely in the number of arithmetic
operations required. The problem of finding an equivalent form that computes the
result with the least number of operations is not trivial and so a software tool for
doing this is desirable.
This chapter is organized as follows. Section 2.1 explains this operation mini
m ization problem with an example. Section 2.2 formalizes the operation minimiza
tion problem in terms of a sequence of formulae that compute the same result as
the original multi-dimensional summation. Section 2.3 describes an expression tree
representation of a formula sequence. Section 2.4 proves that the problem of opera
tion minim ization is NP-com plete. A pruning search procedure for finding a formula
sequence that minimizes the number of operations is developed in Section 2.5. The
extensions of the search procedure to handle common sub-expressions, sparse arrays,
and FFT are described in Sections 2.6, 2.7 and 2.8, respectively. An example of its
application is given in Section 2.9.
20
2.1 Problem Description
Consider, for example, the following multi-dimensional summation
^ H ^ 3-. X B[j, k, (]) ' j fc
A direct way to implement the summation would be;
initiaJ-ize S for i
for j for k for t[ S[t]+=A[i,j,t] xB[j,k,t]
Execution of this program fragment involves N{ x Nj x Nk x Nt floating point
multiplications and an equal number of floating point additions, resulting in a total
of 2 X Ni X Nj X Nk X Nt operations. If the above loop were input to an optimizing
compiler, it would perform dependence analysis [68] on the loop and determine that
the innermost t-loop was an independent loop and that the other three loops involved
dependences due to reduction operations. A lthough the loop could be parallelized, no
attem pt would be made by the compiler to reduce the number of arithmetic operations
involved. As shown below, a considerable saving in the number of operations is in fact
possible for this computation through application of algebraic properties of addition
and multiplication.
Since addition and multiplication can both be considered associative and com
mutative (although floating-point operations are not strictly associative, vectoriz
in g / parallelizing compilers generally treat these operations as acceptably associative),
and multiplication distributes over addition, we have:
1.
21
2. If term X does not depend on i, then Y^i{X x Y ) = X x Y
The first rule allows us to reorder the positions o f any number o f consecutive sum
mations while the second rule permits the extraction of an expression independent of
a summation index out of that sum m ation. By application of the algebraic properties,
we can re\\Tite the summation as:
(]) X B[j\ k, i]))j i k
which can be transformed into the following program fragment:
initiaiize Tempi for i
for j for t[ Tempi [j ,t] +=A [i, j , t]
initiauLize Temp2 for j
for k for t[ Temp2[j ,t]+=B[j ,k,t]
initialize S for j for tS [t] +=Templ [j , t] xTemp2 [j , t]
The new program fragment requires only Nj x Nt floating point multiplications
and Ni x Nj x Nt + Nj x Nk x N + N j x Nt floating point additions. The total number
of floating point operations, which is Ni x Nj x Nt + Nj x Nk x Nt + 2 x Nj x iVj,
is an order of magnitude less than that of the original program fragment. Although
Tempi and Temp2 are two-dimensional arrays, they can be reduced to scalar variables
by doing loop fusions. The optim ization o f memory usage via loop transformations
will be addressed in Chapter 3.
The above example is simple enough to be able to manually seek the optimal
index reordering and application o f the distributive law to m inim ize the number of
22
operations. However, the complex sequence of such summations that arise in some
com putational physics applications are not easily hand-optimized. Thus, a software
tool for doing this is desirable.
2.2 Formalization of the Optim ization Problem
Generalizing from the example of the previous section, the problem addressed is
the minimization of arithm etic operations of a given multi-dimensional summation.
We are interested in deriving an autom atable strategy for operation reordering and
application of the distributive law to reduce the amount of arithm etic required. How
ever, autom atic generation of transformations such as those needed to transform the
standard matrix m ultiplication algorithm into Strassen’s algorithm are clearly beyond
the scope of what we believe is feasible.
Hence, we first have to define more precisely the space of equivalent programs that
are to be searched am ongst. We formalize this space as a set of formula sequences.
Each formula in a formula sequence is either;
• a multiplication formula of the form: fr[- - ■] — A'’[ . ..] x Y [ . ..], or
• a summation formula of the form: f r [ . ..] = ITi A [ . ..]
where X (and Y ) is either a product term in the given multi-dimensional summation
or a previously defined function fs-
Let X .d im e n s denotes the sets o f indices in A [ . ..]. For a formula to be well-
formed, every index in X and Y , except the summation index in the second form,
must appear inside f r [ . . .]. Thus, fr -d im ens = X .d im e n s U Y.d imens for any mul
tiplication formula, and fr -d imens = X .d im en s U {z} for any summation formula
23
with summation index i. Each formula in a sequence computes a partial result of the
summation and the last formula produces the final result desired. Such a formula
sequence fully specifies the m ultiplications and additions to be performed in com
puting the result, and it is straightforward to generate loop code corresponding to a
particular formula sequence.
For example, the summation S[t] = x B [ j ,k , t \ ) can be repre
sented by the formula sequence below:
/ i [z, j , k, t] = j , t] X B[j, k, t]
f2[ij,t] = J2Mhj;k.,t]k
JS[t] = I^ /3 [z ,i]
i
whereas the optimized form 5[£] = .4[z, j , £]) x {J2k B[j, k, t])) corresponds to
the sequence:
i
f2[j, t] = ^ £]k
‘5'W = 5 2 /sD',j
The cost, or total number of floating point operations, of a formula sequence is the
sum of the costs of the individual formulae in the sequence, which can be obtained
as below:
24
• For a multiplication formula of the form /r[---] = % [...] x the cost
is Uhex.dimensuY.dimens For iustauce, the formula = A [ i , j , t ] x
B[j, k, t] has a cost of Ni x Nj x Nk x Nt.
• For a summation formula of the form /r[---] = the cost is {Nt —
1) X n/ieA'.*mens-{i} Ah- The term Ni — 1 comes in because adding N numbers
requires Ni — 1 additions. But for the sake of simplicity, we may approximate
this cost as AQ. For example, the cost of the formula foihJA] = H k f i [ h j \ k , t ]
is {Nk — 1) X iVj- X Nj X Nt or sim ply Ni x Nj x Nk x Nt-
A well-formed formula sequence computes a given multi-dimensional summation
if it satisfies the following conditions:
• Each product term in the given summation appears exactly once among the
sequence of formulae.
• Each summation index in the given summation is summed over in a single
formula.
• No summation index appears in any formula subsequent to summation over
that index.
• Each defined function f r [ . ..], except the last one, appears exactly once among
the formulae subsequent to its definition.
Hence, a valid sequence must contain exactly n — 1 multiplication formulae and k
summation formulae, where n is the number of product terms in the given summa
tion and k is the number of summation indices in the given summation. W ith the
above formalization, the operation minimization problem can be restated as finding
25
s J2j
/ s X
f l STi /2 'èk
A[i, j , t] B\j ,kA]
Figure 2.1: An exam ple expression tree.
a formula sequence that computes a given multi-dimensional summation and incurs
the least cost, i.e. requires a minimum number of floating point operations.
2.3 Expression Tree Representation
A formula sequence can be represented graphically as an expression tree to show
the hierarchical structure of the com putation more clearly. As an example, the expres
sion tree that represents the above optim ized formula sequence is shown in Figure 2.1.
In an expression tree, the leaves are the product terms in the summation and the
internal nodes are the f r [ . ..] defined by the formulae, with the last formula at the
root. An internal node may either be a m ultiplication node or a summation node.
A multiplication node corresponds to a m ultiplication formula and has two children
which are the terms being multiplied together. A sum m ation node corresponds to a
summation formula and has only one child, representing the term on which sum m ation
is performed.
26
2.4 NP-Com pleteness
In this section we prove that the decision version o f the operation minimization
problem formalized in the previous section is NP-com plete. To show this, we identify
a simpler sub-problem: the multiplication sub-problem. Proving the decision version
of this sub-problem to be NP-com plete will prove the decision version of the operation
minim ization problem to be NP-complete.
The m ultiplication sub-problem is one where no sum m ation indices are present at
all. Given a set o f array variables, they are to be multiplied together in some order so
as to minimize the total m ultiplication cost. Note that m ultiplication in this context is
comm utative and the variables can be rearranged and grouped in any order. Thus the
polynomial tim e dynamic programming algorithm for the matrix-chain multiplication
problem does not generalize to our problem.
To illustrate the m ultiplication sub-problem, let us consider the example
.4 [z] X B[j] X C[i, X D[j, k]
where jV, = iVj = 10 and = 20. One way to perform the m ultiplications is
which requires Ni x Nj -t- 2 x N x Nj x Nk = 4100 arithm etic operations. However,
this is not optimal; the optim al order of multiplication is
(A M x C [z \A :])x (g |;]x D [7 ,A ;])
which requires only Ni x Nk -i- Nj x Nk -f N x Nj x N t = 2400 arithmetic operations.
Note that the cost of each node in the expression tree representing the order of
27
multiplication is equal to the product of the sizes o f the indices of the array represented
at that node. The m ultiplication cost of the root in the expression tree is fixed and
independent of the order o f multiplication.
The multiplication sub-problem can be formally stated as follows. Given a finite
set / , a size N{ G for each i G I, and a family T of subsets of I such that
I = \J se r^ - Slid a binary tree T whose leaf nodes are the sets in JF that minimizes
IZ„e7’-:Fc('u), where T — !F \s the set of internal nodes (including the root node) in
T, c{v) = is the cost of node v, J(v) = Us€£)(u) S, and D(v) is the set
of leaf nodes in the subtree rooted at v. The corresponding multiplication decision
problem asks whether there is a solution T that costs no more than a given integer
K . The decision problem is in NP because an expression tree can be generated non-
deterministically, and it can be checked in polynomial time whether it costs no more
than K .
The operation minimization problem can be stated as follows. Given a multi
dimensional summation in the form described in Section 1.2, find an expression tree
that specifies the evaluation order of the multiplications and summation and involves
the number of operations. The corresponding decision problem asks whether there is
an expression tree that evaluates a given multi-dimensional summation and costs no
more than a given integer K .
We prove the NP-completeness of the multiplication decision problem in two steps.
First, we reduce a known NP-complete problem, the Subset Product problem, to
a new problem that we call the Product Partition problem. Next, we reduce the
28
Product Partition Problem to the m ultiplication decision problem. This proves NP-
completeness of the multiplication decision problem as well as the NP-completeness
of the decision version of the operation minimization problem.
The Subset Product problem can be defined as follows. Given a finite set .4, a
size s(a) G for each a € .4, and a positive integer y, determine whether there
exists a subset 4' Ç .4 such that Ilae.-i' ~ V- This problem is known to be NP-
complete [20]. The Product Partition problem is similar. Given a finite set B and
a size s'(b) G for each b E B, the Product Partition problem asks whether there
exists a subset B' Ç. B such that HaeB' — libeB-B' a'(6).
Let X = riaeA ^(a), where (A, s, y) is an instance of the Subset Product problem.
Note that if 2 x j y is not an integer, then there is no solution to this instance. Other
wise, reduce the Subset Product problem to the Product Partition problem by adding
two new elements of sizes 2 x / y and 2y to the set A. Formally, for each Subset Product
problem instance ( A , s , y ) , form a Product Partition problem instance ( 5 , s') where
B = A U {b', 6"}, s'(a) = s(a) for all a G -4, 6' ^ A, b" 0 .4, s'(6') = 2 x /y , s '(6") = 2y,
and X = riaeA s(n). If the Subset Product problem instance ( 4, s, y) has a solution
A', then B' = A' U 6' is a solution to the Product Partition problem instance (S , s')
since fldeB' ~ UbeB-B’ '{b) = 2x. Conversely, if the Product Partition problem
instance (B , s') has a solution B', then either one of b' or 6" (but not both) must
belong to B' because b'b" = 4x is greater than H ies' ^'(6) = a'(6) = 2x. In
this way, either A' = B — B' — (6'} or A' = B ' — {6'} would be a solution to the Subset
Problem instance (A, s, y). Since this reduction can be performed in polynomial time,
it follows that the Product Partition problem is NP-complete.
29
L^x X L^x ^
L y / x X X L y / x L y X X L x / y x x z
M^y/x X Xy/xMf/r X X X X X X
A A(a) (b) (c)
Figure 2.2: Different expression trees for a m ultiplication decision problem instance.
Given a Product Partition problem instance {B, s'), we construct a m ultiplication
decision problem instance as follows. Let x = IlieB and note that if y /x is
not an integer, then the instance has no solution. Otherwise, reduce the Product
Partition problem to the m ultiplication decision problem as follows. First, for each
b E B, form a one-dimensional array M(j[ib\ where = s'{b). Then, add two more
arrays and where Ni^, = Ni^, = L = {n + l ) \ / x , n = \B\, and
X = riaeB s '(6). This reduction can be done in polynom ial time. Arrays My and My/
are so large that their relative position in the expression tree becomes very significant.
These n + 2 arrays, namely A/y, A /y and Mf, for each b E B, and a maximum cost
K = L^x + 2 L y / x + {n — 2) y / x define the m ultiplication decision problem instance.
We shall prove that the Product Partition problem instance {B, s') has a solution if
and only if the multiplication decision problem instance has a solution.
We use two facts about the m ultiplication decision problem. First, the cost of the
root node is the same in all solutions and is the product of the sizes of all indices,
in this case L^x. Second, the cost at any node is bounded by the cost of any of its
ancestors. Thus the cost of any solution is bounded by (n + l)L^x.
30
If the Product Partition problem instance {B, s') has a solution, then we can
construct an expression tree T for the m ultiplication decision problem instance so
that two non-sibling third-level nodes have equal cost \ / x and the two second-level
nodes at the second level have equal cost L y / x , as shown Figure 2.2(a). We establish
an upper bound on the cost o f T as follows. Since the n + 2 arrays require n 4- 1
m ultiplications, there are n - f l —3 = n —2 m ultiplication nodes below the second level
in the expression tree. Thus, T has a cost of at most K = L'^x -h 2 L y / x + {n — 2) y /x
and hence is a solution to the m ultiplication decision problem instance.
Conversely, if the m ultiplication decision problem instance has a solution T (that
costs no more than K = L '^x - \ -2 L y / x - \ - { n — 2 ) y / x ) , we argue that the two second-level
nodes of T (i.e. the two children of the root node of T) must have equal costs. The
reason is that if a solution to the multiplication decision problem instance has unequal
cost on the two second-level nodes, say L y on one node and L x ! y on the other node,
where y / y / x , then the expression tree would have a cost o f at least L / x + L y - ^ - L x j y .
This is greater than K = L r x 4- 2L y / x -f- (n — 2 ) y / x , since y + x f y > 2 y / x 4- 1 and
L > [ n — 2 ) y / x (see Figure 2.2(b)). Moreover, M y and My, must be in the two
separate subtrees rooted at the two second level nodes of T. If Afy and My, are not
split at the second level, the expression tree would have a cost of at least L ^ x + L~ + x
which is greater than K = L / x 4- 2L y / x + {n — 2 ) y / x (see Figure 2.2(c)). Removing
M y and My, from the leaf nodes in these two subtrees gives us a solution to the
Product Partition problem instance
It follows that the m ultiplication decision problem is NP-com plete. Since every in
stance of the m ultiplication sub-problem is an instance o f the operation minimization
problem, the decision version of the operation m inim ization problem is NP-complete.
31
2.5 A Pruning Search Procedure
Since the operation minimization problem has been shown to be NP-complete, it
is impractical to seek a polynomial-time algorithm for it. We have to resort either
to heuristics or use exponential-tim e searches for the optim a. For the kind of multi
dimensional summations that arise in practice, the number of nested loops and the
number of array variables is typically less than ten. Thus, a well-pruned search
procedure should be practically feasible. We pursue such an approach here.
The following procedure can be used to exhaustively enumerate all valid formula
sequences:
1. Form a pool o f the product terms in the given multi-dimensional summation.
Let Xa denote the a-th product term and Xa-dimens the set o f index variables
in X a[...]. Set r to zero.
2. Increment r. Then, perform either action:
(a) Write a formula /r[---] = Xq[...] x X&[...] where Xq[...] and X&[...] are any
two terms in the pool. The indices for fr are fr-dimens = Xa..dimens\J
Xb-dimens. Replace Xa[...] and X(,[...] from the pool by / r [ . . .].
(b) If there exists an summation index (say i) that appears in exactly one term
(say Xa[...]) in the list, increment r and create a formula /r[--.] = Xa[...j
where fr-dimens = Xa-dimens — {%}. Replace Xq[...] in the pool by fr[- - -]-
3. When step 2 cannot be performed any more, a valid formula sequence is ob
tained. To obtain all valid sequences, exhaust all alternatives in step 2 using
depth-first search.
3 2
E , E , E ; E ,I I I I .
- A r / X -Eg Z [ i , j , . . . \ Z [ i , j , . . . \ Z i[ i , . . -1 Z2[...J Zi
V'[i\...l y[:,.
(a) (b) (c)
Figure 2.3: Expression tree transformations for the first pruning rule.
The enumeration procedure above is inefficient in that a particular formula se
quence may be generated more than once in the search process. This can be avoided
by creating an ordering among the product terms and the intermediate generated
functions (which can be treated as new terms, numbered in increasing order as they
are generated).
-A. further reduction in the cost of the search procedure can be achieved by pruning
the search space by use o f the following two rules:
1. When a summation index appears in only one term, perform the summation
over that index immediately, without considering any other possibilities at that
step.
2. If two or more terms have exactly the same set of indices, first multiply them
together before considering any other possibilities.
It can be proved that the use of the pruning rules will not change the cost of the
optimal formula sequence found. To prove the first pruning rule, suppose Y is the
only term in the pool that contains a summation index i and we choose to multiply
V with another term instead of summing it over i. A formula sequence obtained in
3 3
this way can be represented by an expression tree T in which i does not appear in
any node outside the subtree rooted at Y and the Yli node is an ancestor but not
the parent of Y . We show that T can be transformed into a similar expression tree
T' where the Xli node is the parent o f Y w ithout increasing the cost. Figure 2.3(a)
illustrates this transformation, which is done by moving the node towards Y' one
step at a time until it becomes the parent o f Y . If the child of the XTi node is another
sum m ation node (say J^j). we can interchange the positions o f the two summation
nodes, as shown in Figure 2.3(b), w ith no effect on the cost because both
and 5Zy(Z!i ■2') have the same cost o f (Ni x Nj — 1) x nkez.dimens-iij} ^k- If the child
of the node is a m ultiplication node with two children Z\ and Z2 where Zi is Y' or
an ancestor o f Y , then we can move the node to become the parent of Zi and a
child of the m ultiplication node (see Figure 2 .3(c)). This move affects the costs o f the
node and the m ultiplication node only. The cost of the sum m ation node cannot
increase because Zi-dimensLi Zo.dimens 3 Zi.dimens. The cost o f the multiplication
node is reduced by a factor o f Ni since m ultiplying Zi and Zo involves one more index,
which is 2 , than multiplying Zi and Zg.
To prove the second pruning rule, suppose Fi and Fo are two terms in the pool
that have the same set of indices and we choose to multiply Y\ by another term Z
where Z.dimens ^ Y^.dimens. A formula sequence thus obtained can be represented
by an expression tree T in which Fi and F are the roots of two separate subtrees
and Fi and Z are children of a m ultiplication node, as shown Figure 2.4(a). We can
transform T into a similar expression tree T' by moving the subtree rooted at Fi to
becom e a sibling of }g, as shown Figure 2.4(b). This transformation only affects the
cost o f the nodes along the paths from Fi and Fg to their common ancestor. Let
34
f r + s f r + s
f r + a fr -{ -s—l f r + a f r - ^ s —l
f r - ^ i f r - td - h i fr - i- i ./^r-ra-f-1
/ \ / \ / r ? 2 2
Yi z n %(a) (b)
Figure 2.4: The expression tree transformation for the second pruning rule.
those nodes be labeled as shown in the figure, where 0 < a < s. We show that the
transformation does not increase the total operation count by comparing the costs of
corresponding nodes in T and T' as follows.
• cannot cost more than /r because fr-indices D Yi.dimens and f^..indices =
Y\.dimens.
• fr-indices D Z.dimens implies for all 0 < /: < a, fr+k-indices D /^^/..indices and
thus costs less than or equal to fr+k-
• fr-indices = Yy-dimens = Y-z-dimens implies for all a + 1 < k < s — 1,
fr-y-k-indices = f'^_ f.-indices and hence f^^f. and /r+t have the same cost.
• Since fr-y-a-indices D fl^^.indices and fr-y-s-i-indices = f^^^_y-indices, we have
fr-y-s-indices D fl-_ _ -indices, which implies cannot cost more than /r+s-
Based on the above two rules, together with the ordering of the product terms
that guarantees a formula sequence is generated only once, we obtain the following
pruning search procedure:
3 5
1. Form a list of the product terms in the given multi-dimensional summation.
Let Xa denote the a-th product term and X^.dimens the set of index variables
in Set r and c to zero. Set d to the number of product terms.
2. While there exists an summation index (say i) that appears in exactly one term
(say A'a[...]) in the list and a > c, increment r and d and create a formula
/r[---] = A'a[...] where fr-dimens = Xa-dimens— {z}. Remove Ag[...j from the
list. Append to the list = /r[--]- Set c to a.
3. Increment r and d and form a formula /r[---] = A'a[...] x Aô[...] where AaB--] and
%(,[...] are two terms in the list such that a < b and b > c, and give priority
to the terms that have exactly the same set of indices. The indices for fr are
fr-dimens = Xa-dimens U Xh-dimens. Remove Aa[...] and A&[...] from the list.
Append to the list X(i[...] = /r[.-.]. Set c to 6. Go to step 2.
4. When steps 2 and 3 cannot be performed any more, a valid formula sequence is
obtained. To obtain all valid sequences, exhaust all alternatives in step 3 using
depth-first search. The search space can be further pruned using the branch-
and-bound technique by giving up any search path as soon as its partial cost
becomes greater than the lowest cost among all complete sequences found so
far.
2.6 Common Sub-Expressions
In some computational physics applications, the same input array may appear in
more than one product term and each occurrence of an repeated input array may be
associated with different index variables. This leads to common sub-expressions in
36
the multi-dimensional summation. Formulae involving repeated input arrays could
actually be computing the same intermediate result. Consider the following function:
R [ i ,k ,m \ = x B[Lm ] x A[kJ] x B[j,k])j I
Assume Ni = Nj = Nk = Ni = Nm- One formula sequence that computes R [ i .k ,m ]
is:
f i [î7 i , k] = A [ i , j ] x B[j\ k]
l2[hk] = ' ^ f i [ i J : k ] j
fzlk.l^m] = B[Lm] X A[k,l]
h[k ,m] = ^ /3 [A :,/,m ]I
R[i, k, m\ = /2[z, A:] x f^lk, m]
Here, f i and fz are computing the same intermediate result since / 3 [A:,/, m] =
fi[k, /, m]. We call f i and fz equivalent formulae or common sub-expressions and their
costs should be counted only once. In other words, fz can be obtained without any
additional cost. Notice that /g and are also equivalent to each other as f^[k, m] —
/ 2 [A;, m]. Thus, has a zero cost.
To detect common sub-expressions, we compare each newly-formed formula against
each existing formula and see if the two formulae are o f the same type (i.e. they
are both multiplication or both sum m ation), the operand arrays are equivalent, and
the index variables in the operand arrays (and the sum m ation index for sum m ation
formulae) can be mapped one-to-one from one formula to the other. In the above
example, f i and fz are equivalent because they are both products of arrays .4 and B
and the indices k, I and m in fz can be mapped one-to-one to the indices i, j and
37
k in / i respectively. Also, is equivalent to fo because they are both summation,
their operands fz and f \ are equivalent, and the indices k and m can be mapped
one-to-one to i and k.
2.7 Sparse Arrays
When som e input arrays are sparse, the intermediate and final results could be
sparse and the number o f operations required for each formula is lower than if the
arrays are dense. For estim ating the operation count, there is a need to determine the
sparsity o f the intermediate and final results and the reduced cost o f each formula.
We make two assumptions about the sparsity o f the input arrays. First, we assume
that sparsity only exists between two of the array dim ensions, i.e., whether an element
is zero depends only on the indices of exactly two dimensions. Second, we assume
sparsity is uniform, i.e. the non-zero elements are uniformly distributed among the
range of values o f an array dimension if it involves sparsity.
However, the above assumptions are not enough to determine the sparsity of
the result arrays; it depends further on the structure of the sparse arrays. In the
computational physics applications we consider, some array dimensions corresponds
to points in three-dim ensional space and sparsity arises from the fact that certain
physical quantities dim inish with distance and are treated as zero beyond a cut-off
limit. For such sparse arrays we develop an analytical model for the sparsity of
the result arrays in terms o f the sparsity of the operand arrays of a summation or
multiplication formula, based on the following observation. Finding the resulting
sparsity is equivalent to finding the probability that a set of randomly distributed
points in 3D space satisfies some given constraints on their pairwise distances.
3 8
We represent the sparsity of each array as a list of sparsity entries. Under the
above assumptions about input arrays, they can only have a single sparsity entry
each, but intermediate arrays may have multiple sparsity entries. Each sparsity entry
contains the two dimensions of the array involving the sparsity and a sparsity factor.
The sparsity factor is defined as the fraction of non-zero elements between the two
dimensions (which is always between 0 and 1 ) and is proportional to the cube of the
cut-off limit between the two points in 3D space corresponding to the two dimensions.
The overall sparsity of an intermediate array (i.e., the fraction of non-zero elements
in the entire array) is the product o f all sparsity factors in its sparsity entries. The
sparsity o f an array can also be viewed as a graph in which each array dimension
involving sparsity is a vertex and each sparsity entry is an edge that connects the
two vertices for the two array dimensions in that entry. We call this graph a sparsity
graph. For convenience, when an array reference is given, we use index variables to
refer to array dimensions and to label the vertices in a sparsity graph. As an example,
Figure 2.5(a) shows the sparsity entry and the sparsity graph of an array referenced
by .4[i, j , k] in which sparsity exists between the last two dimensions and 2% of its
elements are non-zero.
For a multiplication formula, we combine the sparsity entries of the two operand
arrays to obtain the sparsity entries o f the result array and examine the resulting
sparsity graph. If no cycle is formed in the graph, no further work is needed and
the overall sparsity of the result array is the product o f those of the operand arrays
since the dimensions involving sparsity represent independent points in 3D space. A
cycle of size 2 is formed if both the operand arrays have sparsity between dimensions
referenced by the same pair o f index variables. The two sparsity entries forming the
3 9
Array A[uj,k] B{jJ] A[Lj\k]x A[ i , jJ ]x i r - i- nS y j ]
ü :* a 0 5 ? ü l t o M )
Overallsparsity 0.02 0.05 0.001 0.0004 0.262
k kSparsity 3 k 3 1 j 3 k Igraph
(a) (b) (c) (d) (e)
Figure 2.5: Sparsity entries and sparsity graphs of arrays.
cycle can be coalesced into one by keeping the smaller of the two sparsity factors.
This is because, for both operand elements to be non-zero, the distance between
the pair of points that the two index variables represent in 3D space must be less
than the smaller of the two cut-off lim its corresponding to the two sparsity factors.
Figure 2.5(c) and (d) show the resulting sparsity entries and the sparsity graphs for
the two above cases. Determining the overall sparsity in the presence of cycles of size
3 or larger requires solving multi-dimensional summations and is not considered in
this thesis.
For a summation formula, the overall sparsity of the result array is the proba
bility that for a set of randomly distributed points in 3D space, there exists a point
(corresponding to the summation index) such that some given distance constraints
are satisfied. We first copy the sparsity entries from the operand array to the result
array, paying special attention to the entries involving the summation index. If only
40
one sparsity entry has the sum m ation index, we remove it from the result array. If
the sum m ation index (say i) appears in exactly two sparsity entries, say { i , j , s i ) and
(z, k, S2 ), we replace the two entries with a single entry {j, k, + \ / ^ ^ ) because
the cut-off limit of the distance between the two points represented by j and k equals
the sum o f the two given cut-off lim its. Figure 2.5(e) shows an example of the second
case. W hen the sum m ation index appears in more than two sparsity entries, we need
to solve some multi-dimensional summations.
Once the sparsity of a result array is known, finding the cost of the formula that
computes the result array is straightforward. In a m ultiplication formula, the number
of operations is the same as the number of non-zero elements in the result array. In a
sum m ation formula, the number of operations equals the number of non-zero elements
in the operand array minus the number of non-zero elements in the result array (since
adding N numbers requires only N —1 additions). These operation counts are exact
if mechanisms exist to ensure operations are performed only on non-zero operands;
otherwise, the operation counts would represent lower bounds on the actual number
of operations.
2.8 Fast Fourier Transform
In many of the m ulti-dimensional summations, some of the product terms are
exponential functions of some o f the indices. Since the exponential function is unique,
we consider all product terms that are exponential functions identical. We also assume
the products of exponential functions can be obtained at zero cost since the products
are them selves exponential functions.
41
The existence o f exponential functions in the product terms permits some formulae
to be computed as fast Fourier transforms (FFTs). which may involve significantly
fewer operations than if the same result is computed as a multiplication followed by
a summation. The cost o f an FFT formula equals the number of individual FFTs
to be performed tim es the cost o f each individual FFT . The general form of an FFT
formula is
f r [ K J ] = ^ X [ K , i ] X exp[z\j] i
where K is a set of indices, j ^ K and exp[i,j] = is dense, the
number o f operations in com puting fr is given by
n x C x /Vlog2 iV^ k e K J
where N = max(iVj, Nj) , C is a constant that depends on the FFT algorithm in use
and C X N logg N is the cost of each individual FFT.
Sparsity in the operand array could lower both components of the FFT cost.
Sparsity factors involving the summation index i reduce the size of individual FFTs
and those not involving i reduce the number o f FFTs. W hether X is sparse or not,
the number of operation can be expressed as
Nj
where N = max(/V} x s i z e ( X ) / s i z e { f r ) , Nj), s ize(A) denotes the number of non-zero
elements in array A, and Nj x s i z e { X ) / s i z e { f r ) is the number of non-zero elements
from X that participate in each individual FFT. For example, consider the formula
= H i AT[z, A;, f] X exp[z, j] in which X has sparsity entries (z, fc, 0.16) and
{k , l ,0 .2) . The resulting sparsity of / i is given by (k , l ,0 .2 ) . If N = 800, N j =
Nk = Ni = 100 and C = 10, then s i ze (X ) = 800 x 1 0 0 x 0.16 x 0.2 = 2.56 x 1 0 \
42
s i ze { f i ) = 100^ X 0.2 = 2 X 10^, N = max( — 128, and the number
of operations is x 10 x 128 logs 128 = 1.792 x 10^.
Since the result array and its sparsity are the same whether it is computed using
FFT or not, this choice does not interact with subsequent formulae. Thus, we can
compare the FFT cost with the combined cost of multiplication and sum m ation and
pick the lower one.
2.9 An Example
The algorithm for searching a formula sequence with the minimum number of
operations as described above has been implemented and tested. We have applied
the program on several com putational physics applications that involve very complex
integrals. The optimal solutions generated by the program are usually a factor of two
better than the best manually-optimized solutions. One example application that
determines self-energy in electronic structure of solids is specified by the following
input file to our program:
sum r,rl,RL,RLl,RL2,RL3 Yfr.RL] Y[r,RL2] Y[rl,RL3] Yfrl.RLl]G[RLl,RL,t] G[RL2,RL3,t] exp[k,r] exp[G,r] exp[k,rl] exp[Gl,rl]
RL,RL1,RL2,RL3 1000 t 100 k 10G,G1 1000 r,rl 100000 spairse Y 1,2,0.1 end
The first 2 lines show the sum m ation to be computed. The next 5 lines are the
ranges of the index variables. The last 3 lines specify the sparsity in the input arrays.
Note that the first 4 product terms involve an identical input array (as do the next
4 3
2 product terms) and the input array V is sparse. The constant C in FFT formulae
is set to 10. The best hand-obtained formula sequence has a cost of 3.54 x 1 0 *
operations. Our program enumerated 369 formula sequences and found 42 sequences
of the same minimum cost of 1.89 x 1 0 ^ operations. One minimum-cost sequence
(which involves 2 FFT formulae) generated by the program is shown below:
fl[r,RL,RLl,t] = Y[r,RL] * G[RLl,RL,t] cost= le+12 <r,RL,0.1> f2[r,RLl,t] = sum RL f1[r,RL,RLl,t] cost= 9.9e+ll dense f5[r,RL2,rl,tj = Y[r,RL2] * f2[rl,RL2,t] cost= le+14 <r,RL2,0.1>f6[r,rl,t] = sum RL2 f5[r,RL2,rl,t] cost= 9.9e+13 dense f7[k,r,rl] = exp[k,r] * exp[k,rl] cost= 0 denseflO[r,rl,t] = f6[r,rl,t] * f6[rl,r,t] cost= le+12 densef11[k,r,rl,t] = f7[k,r,rl] * flO[r,rl,t] cost= le+13 densefl3[k,rl,t,G] = fft r f11[k,r,rl,t] * exp[G,r] cost=l.660964e+15 dense fl5[k,t,G,Gl] = fft rl f13[k,rl,t,G] * exp[Gl,rl] cost=l.660964e+13 dense
44
CH A PTER 3
M EMORY USAGE M INIMIZATION
Given an optimal sequence of m ultiplication and summation formulae, the simplest
way to implement it is to com pute the formulae one by one, each coded as a set of
perfectly nested loops, and to store the intermediate results produced by each formula
in an intermediate array. However, in practice, the input and intermediate arrays
could be so large that they cannot fit into the available memory. Hence, there is a
need to fuse the loops as a means of reducing memory usage. By fusing loops between
the producer loop and the consumer loop of an intermediate array, intermediate results
are formed and used in a pipelined fashion and they reuse the same reduced array
space. There are many different ways to fuse the loops, which could result in different
memory usage. The problem of finding a loop fusion configuration that minimizes
memory usage without increasing the operation count is non-trivial.
In this chapter, we develop an optim ization framework that appropriately models
the relation between loop fusion and memory usage. We present algorithms that
find the optim al loop fusion configuration that minimizes memory usage, under both
static and dynamic memory allocation models.
45
Section 3.1 introduces the memory usage minimization problem. Section 3.2 de
scribes some preliminary concepts that we use to formulate our solutions to the prob
lem. Section 3.3 presents an algorithm for finding a memory-optimal loop fusion con
figuration under static memory allocation. Section 3.4 investigates the sub-problem of
determining a memorv-optimal evaluation of expression trees involving large objects
with no or equal fusions. Based on the solution to this sub-problem, we develop in
Section 3.5 an algorithm to solve the memory usage minimization problem under the
dynamic memory allocation model. Sections 3.6, 3.7, and 3.8 discuss how common
sub-expressions, sparse arrays, and fast Fourier transforms affects the memory usage
minim ization problem, respectively. Section 3.9 explores ways to further reduce mem
ory usage at the cost of increasing the number of arithmetic operations. An example
of the application of the memory usage m inim ization algorithm on a com putational
physics problem is given in Section 3.10.
3.1 Introduction
Consider the multi-dimensional sum m ation shown in Figure 3.1(a) computed by
the formula sequence in Figure 3.1(b), which is represented by the expression tree in
Figure 3.1(c). A naive way to implement the com putation is have a set of perfectly-
nested loops for each node in the tree, as shown in Figure 3.2(a). The brackets
indicate the scopes of the loops. Figure 3.2(b) shows how the sizes o f the arrays may
be reduced by the use of loop fusions. It shows the resulting loop structure after
fusing all the loops between A and / i , all the loops among B, C, fo-, and /a , and all
the loops between and fs. Here, A, B, C , / s , and are reduced to scalars. After
fusing all the loops between a node and its parent, all dimensions o f the child array
46
= Z z E ; X X C [ A ; , Z ] )
(a) A multi-dimensional summation x
hU ] = A Ez fs E t= B{j ,k, l ] -X C[k,[]
= /lb ] X A b,^] j] / 2 XW [ k ] = h [ k ] =
(b) A formula sequence computing (a) C[k,l](c) An expression tree for (b)
Figure 3.1: An example multi-dimensional sum m ation and two representations of a computation.
are no longer needed and can be eliminated. The elem ents in the reduced arrays are
now reused to hold different values at different iterations of the fused loops. Each of
those values was held by a different array element before the loops were fused (as in
Figure 3.2(a)). Note that some loop nests (such as those for B and C) are reordered
and some loops within loop nests (such as the and /-loops for S , A , and A )
are permuted in order to facilitate loop fusions.
For now, we assume the leaf node arrays (i.e. input arrays) can be generated
one element at a tim e (by the genu function for array v) so that loop fusions with
their parents are allowed. This assumption holds for arrays in which the value of
each element is a function of the array subscripts, as in many arrays in the physics
computations that we work on. As will be clear later on, the case where an input
array has to be read in or produced in slices or in its entirety can be handled by
disabling the fusion of some or all the loops between the leaf node and its parent.
47
for i for j[ A [ i . j ] = g e n A ( i . j )
for j for k
for 1[ B[j ,k,l]=genB(j ,k , l )
for k for 1[ C[k,l]=genCCk,l)
initialize f1 for i
for j[ fl[j]+=A[i.j]
for j for k
for 1[ f2[j.k.l]=BCj,k,l]xC[k,l]
initialize f3 for j
for k for 1[ f3[j.k]+=f2[j.k.l]
for j for k[ f4[j ,k]=flCj] xf3[j ,k]
initialize f5 for j
for k[ f5[k]+=f4[j,k]
initialize f1 for i
for jr A=genA(i.j)L fl[j]+=A
initialize f3 for k r for 1
C=genC(k,l)^ r j
B=genBCj,k,l) f2=BxC f3[j.k]+=f2
initialize fS for j
for kr f4=fl[j]xf3[j.k]I fS[k]+=f4
for k for 1"[ C[k,l]=genC(k,l)
initialize fS for j
for i[ A[i]=genACi.j) initialize f1 for i [ fl+=A[i] for k
initialize f3 for 1
B=genB(j,k,l) f2=BxC[k.l] f3+=f2
f4=flxf3 f5[k]+=f4
(a) (b) ( c )
h
A
j k h r r
fz
I J
j k l k l
h
A
J kfz
A - -fz-
fI J
Bj k l k l
' A # ' : : ,
j k l k l
(d) (e) (f)
Figure 3.2: Three loop fusion configurations for the expression tree in Figure 3.1.
4 8
Figure 3.2(c) shows another possible loop fusion configuration obtained by fusing
all the J-loops and then all the A:-loops and /-loops inside them. The sizes o f all arrays
except C and /$ are smaller. By fusing the j - , k-, and /-loops between those nodes,
the j - , k-, and /-dimensions of the corresponding arrays can be eliminated. Hence,
B, f i , f 2 , fz , and / i are reduced to scalars while .4 becomes a one-dimensional array.
In general, fusing a /-loop between a node v and its parent eliminates the t-
dimension of the array v and reduces the array size by a factor of iVf. In other words,
the size of an array after loop fusions equals the product of the ranges o f the loops that
are not fused w ith its parent. We only consider fusions of loops among nodes that are
all transitively related by (i.e. form a transitive closure over) parent-child relations.
Fusing loops between unrelated nodes (such as fusing siblings w ithout fusing their
parent) has no effect on array sizes. We also restrict our attention for now to loop
fusion configurations that do not increase the operation count. The tradeoff between
memor}' usage and arithm etic operations is considered in Section 3.9.
In the class o f loops considered in this thesis, the only dependence relations are
those between children and parents, and array subscripts are sim ply loop index vari
ables \ So, loop permutations, loop nests reordering, and loop fusions are always legal
as long as child nodes are evaluated before their parents. This freedom allows the
loops to be permuted, reordered, and fused in a large number of ways that differ in
memory usage. Finding a loop fusion configuration that uses the least memory is not
trivial. We believe this problem is NP-com plete but have not found a proof yet.
^When array subscripts are not simple loop variables, as many researchers have studied, more dependence relations exist, which prevent some loop rearrangement and/or loop fusions. In that case, a restricted set of loop fusion configurations would need to be searched in minimizing memory usage.
49
F u sio n g ra p h s . Let T be an expression tree. For any given node v G T, let
subtreeiv) be the set of nodes in the subtree rooted at v, v.parent be the parent
of V, and v.indices be the set of loop indices for v (including the summation index
V.sumindexif u is a summation node). A loop fusion configuration can be represented
by a graph called a fusion graph, which is constructed from T as follows.
1 . Replace each node u in T by a set o f vertices, one for each index i 6 v.indices.
2 . Remove all tree edges in T for clarity.
3. For each loop fused (say, of index i) between a node and its parent, connect the
i-vertices in the two nodes w ith a fusion edge.
4. For each pair of vertices w ith matching index between a node and its parent, if
they are not already connected with a fusion edge, connect them with a potential
fusion edge.
Figure 3.2 shows the fusion graphs alongside the loop fusion configurations. In the
figure, solid lines are fusion edges and dotted lines are potential fusion edges, which
are fusion opportunities not exploited. As an example, consider the loop fusion con
figuration in Figure 3.2(b) and the corresponding fusion graph in Figure 3.2(e). Since
the j - , k-, and Z-loops are fused between /a and /a, there are three fusion edges, one
for each of the three loops, between / 2 and /$ in the fusion graph. Also, since no
loops are fused between fz and f^, the edges between fz and / 4 in the fusion graph
remain potential fusion edges.
In a fusion graph, we call each connected component o f fusion edges (i.e. a maximal
set o f connected fusion edges) a fusion chain, which corresponds to a fused loop in the
50
loop structure. The scope of a fusion chain c, denoted scope{c), is defined as the set
of nodes it spans. In Figure 3 .2(f), there are three fusion chains, one for each of the
j - , k-, and I-Ioops: the scope of the shortest fusion chain is ( 5 , /a ,/a } . The scope of
any two fusion chains in a fusion graph must either be disjoint or a subset/superset of
each other. Scopes o f fusion chains do not partially overlap because loops do not (i.e.
loops must be either separate or nested). Therefore, any fusion graph with fusion
chains whose scopes are partially-overlapping is illegal and does not correspond to
any loop fusion configuration.
Fusion graphs help us visualize the structure of the fused loops and find further
fusion opportunities. If we can find a set of potential fusion edge(s) that, when
converted to fusion edge(s), does not lead to partially overlapping scopes of fusion
chains, then we can perform the corresponding loop fusion(s) and reduce the size of
some array(s). For example, the z-loops between A and f i in Figure 3.2(f) can be
further fused and array .4 would be reduced to a scalar. If converting all potential
fusion edges in a fusion graph to fusion edges does not make the fusion graph illegal,
then we can completely fuse all the loops and achieve optimal memory usage. But for
many fusion graphs in real-life loop configurations (including the ones in Figure 3.2),
this does not hold. Instead, potential fusion edges may be mutually prohibitive;
fusing one loop could prevent the fusion of another. In Figure 3.2(e), fusing the
j-loops between / i and / i would disallow the fusion o f the &-loops between fz and / j .
A lthough a fusion graph specifies what loops are fused, it does not fully determine
the permutations of the loops and the ordering of the loop nests. As we will see
in Section 3.5, under dynamic memory allocation, reordering loop nests could alter
memory usage without changing array sizes.
51
3.2 Preliminaries
So far. we have been describing the fusion between a node and its parent by the
set of fused loops (or the loop indices such as {z, j } ) . But in order to compare loop
fusion configurations for a subtree, it is desirable to include information about the
relative scopes of the fused loops in the subtree.
S c o p e a n d fu s io n sco p e o f a lo o p . The scope of a loop o f index i in a subtree
rooted at n, denoted scope{i,v) , is defined in the usual sense as the set of nodes in
the subtree that the fused loop spans. That is, if the z-loop is fused, scope(i, v) =
scope(c) n subtree{v), where c is a fusion chain for the z-loop w ith v € scope{c). If
the z-loop of V is not fused, then scope{i, v) = 0. We also define the fusion scope of
an i-loop in a subtree rooted at u as fscope{i, v) = scopefi, v) if the z-loop is fused
between v and its parent; otherwise fscope(i, v) = 0. As an example, for the fusion
graph in Figure 3.2(e), scope{j, fz) = {B , /s , f z } , but fscope{j, fz ) = 0-
Indexset sequence. To describe the relative scopes of a set of fused loops,
we introduce the notion of an indexset sequence, which is defined as an ordered list
of disjoint, non-empty sets of loop indices. For example, / = ({z, A;}, {j}) is an
indexset sequence. For simplicity, we write each indexset in an indexset sequence
as a string. Thus, / is written as { ik , j ) . Let g and g' be indexset sequences. We
denote by |^| the number of indexsets in g, ÿ[r] the r-th indexset in g, and Set{g)
the union of all indexsets in g, i.e., Set{g) = Ui<r<|g| instance, | / | = 2 ,
/ [I ] = {z, A:}, and Set{ f) = Set{{j , i , k)) = { i , j , k } . We say that g' is a prefix of g
if \g'\ < |5 |> g'lWl] ^ g [|y |] , and for all 1 < r < |^'|, g'[r] = g[r]. We write this
relation as prefix{g', g ) . So, (), (z), {k), (ik), ( i k , j ) are prefixes of / , but ( i , j ) is
not. The concatenation of g and an indexset x, denoted g + x, is defined as the
52
indexset sequence g" such that if x 7 0 , then |^"| = |^| + 1 , g"\\g"\\ = x, and for all
1 < r < \g'% g"[r] = p[r]; otherwise, g" = g.
F u sion . We use the notion of an indexset sequence to define a fusion. Intuitively,
the loops fused between a node and its parent are ranked by their fusion scopes in
the subtree from largest to smallest: two loops with the same fusion scope have the
sam e rank (i.e., are in the sam e indexset). For example, in Figure 3.2(f), the fusion
between /g and /s is ( jk l ) and the fusion between and /s is (j. A:) (because the
fused j-loop covers two more nodes, A and / i ) . Formally, a fusion between a node v
and v.parent is an indexset sequence / such that
1 . Set{f) Ç. v. indicesDv.parent. indices,
2 . for all i G Se t{ f ) , the i-loop is fused between v and v.parent, and
3. for all i G /[r ] and i' G f[r'],
(a) r = r' iff fscope(i ,v) = fscope{i', v ) , and
(b) r < r' iff fscope{i, v) D fscope{i', v).
N e s t in g . Similarly, a nesting o f the loops at a node v can be defined as an
indexset sequence. Intuitively, the loops at a node are ranked by their scopes in the
subtree; two loops have the sam e rank (i.e., are in the same indexset) if they have
the same scope. For exam ple, in Figure 3.2(e), the loop nesting at fz is at f^
is ( jk) , and at B is ( jk l ) . Formally, a nesting of the loops at a node v is an indexset
sequence h such that
1 . Set{h) = v.indices and
2. for all i G h[r] and i' G h[r'],
53
(a) r = r' iff scope{i, v) = scope{i', v), and
(b) r < r' iff scope{i ,v) D scope{i',v).
By definition, the loop nesting at a leaf node v must be {v.indices) because all loops
at V have empty scope.
L eg a l fu sio n . A legal fusion graph (corresponding to a loop fusion configuration)
for an expression tree T can be built up in a bottom -up manner by extending and
merging legal fusion graphs for the subtrees of T. For a given node u, the nesting h
at V summarizes the fusion graph for the subtree rooted at v and determines what
fusions are allowed between v and its parent. A fusion / is legal for a nesting h a.t v
if prefix{f, h) and s e t ( /) Ç v.parent.indices. This is because, to keep the fusion graph
legal, loops with larger scopes must be fused before fusing those with smaller scopes,
and only loops common to both v and its parent may be fused. For example, consider
the fusion graph for the subtree rooted at /a in Figure 3.2(e). Since the nesting at / 2
is ( k l . j ) and fz.indices = {jf, k, I}, the legal fusions between fo and fz are (), (k), (/),
(kl), and {k l . j ) . Notice that all legal fusions for a node v are prefixes of a maximal
legal fusion, which can be expressed as
MaxFusion{h,v) = m a x { / \ prefixd^f, h) and s e t ( /) Ç v.parent.indices)
where h is the nesting at v. In Figure 3.2(e), the maximal legal fusion for C is {kl),
and for fo is {k l , j ) .
R e su lt in g n e s t in g . Let u be the parent of a node v. If v is the only child of u,
then the loop nesting at u as a result of a fusion / between u and v can be obtained
by the function
ExtNesting{f, u) = f + {u.indices — Set{ f ) )
54
For example, in Figure 3.2(e), if the fusion between fo and /s is (kl), then the nesting
at fz would be
C o m p a tib le n e s tin g s . Suppose v has a sibling v', f is the fusion between u and
V, and f ' is the fusion between u and v'. For the fusion graph for the subtree rooted
at u (which is merged from those of v and v') to be legal, h = ExtNesting(f, u) and
h' = ExtNesting{f', u) must be compatible according to the condition:
for all i G h[r] and j G h[s],
if r < s and i G h'[r'\ and j G h'[s'], then r' < s'.
This requirement ensures an i-loop that has a larger scope than a j-loop in one subtree
will not have a smaller scope than the j-loop in the other subtree. If h and h' are
compatible, the resulting loop nesting at u (as merged from h and h') is h" such that
for all i G h"[r"] and j G h"[s"],
if i G h[r], i G h'[r'], j G h[s], and j G h'[s'], then
1 . r” = s" implies r = s and r' = s, and
2 . r" < s" implies r < s and r' < s'.
Effectively, the loops at u are re-ranked by their combined scopes in the two subtrees
to form h". As an example, in Figure 3.2(e), if the fusion between f i and / i is / = (j)
and the fusion between fz and / 4 is f = (k), then h = ExtNesting{f, Z ) = {j, k) and
h' = E xtN es t ing{f , f f ) = { k , j ) would be incompatible. But if / is changed to (),
then h = ExtNesting{f, ff) = {jk) would be compatible with h', and the resulting
nesting at f would be {k , j ) . A procedure for checking if h and h' are compatible
and forming h" from h and h' is provided in Section 3.3.
55
T h e “m o r e -c o n s tr a in in g ” r e la t io n o n n e s t in g s . A nesting at a node v is
said to be more or equally constraining than another nesting h' at the same node,
denoted h Ç h', if for all legal fusion graph G for T in which the nesting at v is h,
there exists a legal fusion graph O' for T in which the nesting at v is h' such that
the subgraphs of G and G' induced by T — subtree{v) are identical. In other words,
h n. h' means that any loop fusion configuration for the rest of the expression tree
that works with h also works with h'. This relation allows us to do effective pruning
among the large number of loop fusion configurations for a subtree in Section 3.3. It
can be proved that the necessary and sufficient condition for A Ç A' is that
for all i 6 m[r\ and j G m[s], there exist r', s' such that
1. 2 G m'[r'\ and j G m'[s'],
2. r — s implies r' — s', and
3. r < s implies r' < s'
where m = MaxFusion{h,v) and m' = MaxFusion{h',v). Comparing the nesting at
fz between Figure 3.2(e) and (f), the nesting { k l . j ) in (e) is more constraining than
the nesting (jkl ) in (f). A procedure for determ ining if A □ A' is given in Section 3.3.
3.3 Static M em ory Allocation
Under the static memory allocation m odel, since all the arrays in a program exist
during the entire com putation, the memory usage of a loop fusion configuration is
simply the sum of the sizes of all the arrays (including those reduced to scalars).
Figures 3.3 to 3.6 shows a bottom-up, dynamic programming algorithm for finding
a memory-optimal loop fusion configuration for a given expression tree T. For each
56
node V in t, it com putes a set o f solutions v .soins for the subtree rooted at v. Each
solution s in v . soins represents a loop fusion configuration for the subtree rooted at
V and contains the following information for s: the loop nesting s.nesting at v, the
fusion s.fusion between v and its parent, the memory usage s .m em so far, and the
pointers s .srcl and s.src2 to the corresponding solutions for the children of v.
The set of solutions v . soins is obtained by the following steps. First, if v is a leaf
node, initialize the solution set to contain a single solution using I n itS o in s. Other
wise, take the solution set from a child v.childl of v, and, if v has two children, merge
it (using M erg eS o ln s) with the com patible solutions from the other child v.child2.
Then, prune the solution set to remove the inferior solutions using P ru n eS o ln s . A
solution s is inferior to another unpruned solution s' if s.nesting is more or equally
constraining than s' .nesting and s does not use less memory than s' . Next, extend
the solution set by considering all possible legal fusions between v and its parent (see
E x tS o ln s ) . The size of array v is added to memory usage by A d d M em U sa g e .
Inferior solutions are also removed.
Although the com plexity of the algorithm is exponential in the number of index
variables and the number of solutions could in theory grow exponentially with the
size of the expression tree, the number of index variables in practical applications is
usually small and there is indication that the pruning is effective in keeping the size
of the solution set in each node to a sm all number.
T he algorithm assumes the leaf nodes may be freely fused with their parents and
the root node array must be available in its entirety at the end of the computation.
If these assumptions do not hold, the In itF u s ib le procedure can be easily modified
to restrict or expand the allowable fusions for those nodes.
57
MinMemFusion {T):InitFusible {T)foreach node v in some bottom-up traversal of T
if v.nchildren = 0 then 51 = InitSolns (u)
else51 = v.childl.soins if v.nchildren = 2 then
51 = MergeSolns (Sl,v.child2.solns)S i — PruneSolns (51, u)
V . soins = ExtendSolns (51, v)T.rootoptsoln = the single element in T.root.solns foreach node v in some top-down traversal of T
s = v.optsoln v.optfusion = s.fusion s i = s.srclif v.nchildren = 1 then
v .child 1 .opts oln = s i else
v.childl.optsoln = sl.srcl v.child2.optsoln = sl.src2
InitFusible (T): foreach v Ç.T
if u = T.root then V . fusible = 0
elseV . fusible — V.indices n v .parent.indices
InitSolns {v):s.nesting — {v.fusible)InitMemUsage (s) return {s}
Figure 3.3: Algorithm for static memory allocation.
5 8
MergeSolns (51,52):5 = 0foreach s i E 51
foreach s2 E 52 s.nesting = MergeNesting {s\.nesting, s2.nesting) if s.nesting ^ () then / / if s i and s2 are compatible
s.srcl — s i s.srcS = s2MergeMemUsage (sl,s2 ,s)AddSoIn (s, 5)
return 5
PruneSolns (51,u):5 = 0foreach si E 51
s.nesting = M.axFusion (sl.nesting,v)AddSoln (s, 5)
return 5
ExtendSolns (51,u):5 = 0foreach si E 51
foreach prefix / of sl.nesting s.fusion = /s.nesting = ExtNesting {f,v.parent) s.srcl = sisize = FusedSize (u, /)AddMemUsage {v, f, size, s i, s)AddSoln (s, 5)
return 5
AddSoln (s, 5): foreach s' E 5
if Inferior (s, s') then return
else if Inferior (s',s) then 5 = 5 - {s'}
5 = 5 U {s}
Figure 3.4: Algorithm for static memory allocation, (cont.)
59
M ergeNesting {h,h')i9 = 0 r — r' = 1
X = x' = 0while r < |/i| or r' < \h'\
if X = 0 then X = h[r + +]
if x' = 0 then x' = h'\r' + +]
j/ = X n x ’ if y = 0 then
return () / / h and h' are incompatible5 = + y y = x - x '
x' = x' — X
x = y end whilereturn g / / h and h' are compatible
hQ h': II test if ft is more/equally constrziining than ft'r' = 1 x' = 0for r = I to |ft|
if x' = 0 then if r' > I ft'I then
return false x' = ft'[r' + +]
if ft[r] g x' then return false
x' = x' — ft[r] return true
InitM emUsage (s): s.mem = 0
AddMemUsage (u, / , size, s i, s): s.mem = sl.mem + size
MergeMemUsage (sl,s2, s): s.mem = sl.mem 4- s2.mem
Figure 3.5: Algorithm for static memor>' allocation, (cont.)60
Inferior (s, s') = s.nesting Ç s' .nesting and s.mem > s'.mem
FusedSize (u, / ) = v.indices — {u.stimindei} — /)
ExtNesting (/, u) = f -i- {u.indices — Set (/))
MaxFusion {h.v) = max{/ | preyiz(/, A) and Set (/) Ç v.parent.indices}
Set (/) = Ul<r<|/1 /M
Figure 3.6: Algorithm for static memory allocation, (cont.)
V line src nesting fusion ext-nest memory usage optA I (u) (O') (0 ) 1 VB 2 (jW) (jkl) ( jtZ > 1 VC 3 (to ( t o (tO ;) 1
4 ( t o (t) (t,jO 15 V5 ( t o (0 ( O j t ) 406 (tz> 0 [jkl] 600
h 7 1 (ü> (i) {j: k) 1 + 1 = 28 1 (Ü) 0 {jk) 1 + 100 = 101 V
/2 9 2,3 (W,;> ( t o j ) (ArO;) (1 + 1) + 1 = 310 2,4 (A:,;0 ( A , J 0 ( A : , ;0 (1 + 15) + 1 = 17 y11 2,5 ( O j k ) ( o y t ) (OjA:) ( l + 4 0 ) + l = 4 212 2 ,6 ( j t o (jtO (J'A:/) (1 + 600) + 1 = 602
h 13 10 ( t , ; 0 (A :,;) { k j ) 17 + 1 = 18 y14 12 ( j t o (jk) (J'A:) 602 + 1 = 603
h 15 7,14 C7,t> (Â k) {j: k) (2 + 603) + 1 = 60616 8,13 {k,j ) (A:,J> (101 + 18) + 1 = 120 y17 8,14 {jk) {jk) (jAi) (101 +603) + 1 = 705
h 18 16 (& ,J ) 0 1 0 1 120 + 40 = 160 y
Table 3.1: Solution sets for the subtrees in the example.
61
To illustrate how the algorithm works, consider again the empty fusion graph in
Figure 3.2(d) for the expression tree in Figure 3.1(c). Let Ni = -500, Nj = 100,
Nk = 40, and iV) = 15. There are 2 = 8 different fusions between B and / 2 . Among
them, only the full fusion {jkl) is in B . soins because all other fusions result in more
constraining nestings and use more memory than the full fusion and are pruned.
However, this does not happen to the fusions between C and fo since the resulting
nesting ( k l . j ) of the full fusion (kl) is not less constraining than those of the other 3
possible fusions. Then, solutions from B and C are merged together to form solutions
for /g . For example, when the two full-fusion solutions from B and C are merged, the
merged nesting for f-i is (k l , j ) , which can then be extended by full fusion (between
/ 2 and /a) to form a full-fusion solution for /a that has a memory usage of only 3
scalars. Again, since this solution is not the least constraining one, other solutions
cannot be pruned. Actually, although this solution is optimal for the subtree rooted
at /a, it turns out to be non-optimal for the entire expression tree. Table 3.1 shows
the solution sets for all of the nodes. The “src” column contains the line numbers
of the corresponding solutions for the children. The “ext-nest” column shows the
resulting nesting for the parent. A mark indicates the solution forms a part of
an optim al solution for the entire expression tree. The fusion graph for the optimal
solution is shown in Figure 3.7(a).
Once an optimal solution is obtained, we can generate the corresponding fused
loop structure from it. The following procedure determines an evaluation order of the
nodes:
1. Initialize set P to contain a single node T.root and list L to an empty list.
62
J k
h
A^3
h
/4
Bj k l k l
(a)
initialize fl for i
for jA=genA(i,j) f l[j]+=A
initialize f5 for k
for 1C[l]=genC(k,l)
for jinitialize f3 for 1
B=genB(j ,k,l) f2=BxC[l] fS+=f2
f4=fl[j] xfS f5[k]+=f4
(b)
Figure 3 .r: An optimal solution for the example.
2. W hile P is not empty, remove from P a node v where v.optfusion is maximal
am ong all nodes in P , insert v at the beginning of L, and add the children of v
(if any) to P .
3. W hen P is empty, L contains the evaluation order.
Putting the loops around the array evaluation statem ents is trivial. The initialization
of an array can be placed inside the innermost loop that contains both the evaluation
and use o f the array. The optim al loop fusion configuration for the example expression
tree is shown in Figure 3.7(b).
63
3.4 M emory-Optimal Evaluation Order of Unfused Expression Trees
We now address the problem o f finding an evaluation order o f the nodes in a given
unfused expression tree that minimizes memory usage under the dynamic memory
allocation model. In this section, we do not consider any loop fusions between the
nodes in an expression tree. The solution developed in this section will be applied to
fused expression trees in Section 3.5.
In this problem, the expression tree must be evaluated in som e bottom -up order,
i.e., the evaluation of a node cannot precede the evaluation o f any of its children.
Since the nodes in the expression tree are not fused, any bottom -up traversal is a
legal evaluation order. The nodes o f the expression tree are large data objects whose
sizes are given. The total size o f the data objects could be so large that they cannot
all fit into memory at the same tim e. Space for the data objects has to be allocated
and deallocated dynamically. Due to the parent-child dependence relation, a data
object cannot be deallocated until its parent node data object has been evaluated.
The objective is to minimize the maximum memory usage during the evaluation of
the entire expression tree. This problem could also arise in other applications such
as register allocation, database query optim ization, and data m ining.
As an example of the unfused memory usage optimization problem, consider the
expression tree shown in Figure 3.8. The size of each data object is shown alongside
the corresponding node label. Before evaluating a data object, space for it must be
allocated. That space can be deallocated only after the evaluation of its parent is
complete. There are many allowable evaluation orders of the nodes. One of them
is the post-order traversal (A, B, C, D , E, F, G, H, I) of the expression tree. It has a
64
[ : 16
F : 15 H : b
IF : 3 F : 16 G : 25
A : 20 D : 9
C : 30
Figure 3.8: An example unfused expression tree.
maximum memory usage of 45 units. This occurs during the evaluation of H, when
F , G, and H are in memory. Other evaluation orders may use more memory or less
memory. Finding the optimal order (C, D, G, H, A. B , E , F, I) , which uses 39 units of
memory, is non-trivial.
Section 3.4.1 formally defines the unfused m em oiy usage optimization problem
and makes som e observations about it. Section 3.4.2 presents an efficient algorithm
that solves the problem in time for an n-aode expression tree. We show the
correctness o f the algorithm in Section 3.4.3.
3.4.1 Problem Statement
The problem addressed is the optimization of memor}- usage in the evaluation of
a given expression tree, whose nodes correspond to large data objects of various sizes.
Each data object depends on all its children (if any), and thus can be evaluated only
after all its children have been evaluated. The goal is to find an evaluation order of
the nodes that uses the least amount of memory’. Since an evaluation order is also a
65
traversal of the nodes, we will use these two terms interchangeably. Space for data
objects is dynamically allocated or deallocated under the following assumptions:
1. Each object is allocated or deallocated in its entirety.
2. Leaf node objects are created or read in as needed.
3. Internal node objects must be allocated before their evaluation begins.
4. Each object must remain in memory until the evaluation of its parent is com
pleted.
We define the problem formally as follows:
Given a tree T and a size v.size for each node v E T, find a computation
of T that uses the least memory, i.e., an ordering P — {vi, . . . , v^) of
the nodes in T, where n is the number of nodes in T, such that
1. for all Vi, Vj, if Vi is the parent of vj, then i > j] and
2. max„;ep{himem(uj, P ) } is minimized, where
himem(uj, P ) = lomem(ui_i, P ) + Vi.size
lom em («i,P ) = | P ) - E,ch„d or.,)
Here, himem(ui, P ) is the memory usage during the evaluation of in the traversal
P , and lomem(uf, P ) is the memory usage upon completion of the same evaluation.
In general, before evaluating Vi, we need to allocate space for u,-. After Vi is eval
uated, the space allocated to all its children may be released. As an illustration,
consider the post-order traversal P = (A, B, C, D , E, F, G, H, I) o f the expression
66
Node himem lomemA 20 20B 23 3C 33 33D 42 12E 28 19F 34 15G 40 40H 45 20I 36 16
max 45
Node himem lomema 25 25H 30 5a 35 35D 44 14E 30 21A 41 41B 44 24F 39 20I 36 16
max 44
Node himem lomemC 30 30D 39 9a 34 34H 39 14A 34 34B 37 17E 33 24F 39 20I 36 16
max 39
(a) Post-order traversal (b) A better traversal (c) The optimal traversal
Table 3.2; Memory usage of three different traversals o f the expression tree in Figure 3.8.
tree shown in Figure 3.8. During and after the evaluation of A, A is in memory. So,
himem(.4, P ) = lom em (4, P) = A.size = 20. To evaluate B, we need to allocate
space for B, thus him em (B, P) = lomem(-4, P ) + B.s ize = 23. After B is obtained, A
can be deallocated, giving lomem(B, P ) = h im em (S , P ) — A.size = 3. The memory
usage for the rest of the nodes can be determined in a similar fashion and is shown
in Table 3.2(a).
However, the post-order traversal of the given expression tree is not optimal in
memory usage. For this example, none of the traversals that visit all nodes in one sub
tree before visiting another subtree is optim al. For the given expression tree, there are
four such traversals. They are (4 , B, C, D , E , F, G, H, I), {C, D, E, 4 , B, F, G, H, / ) ,
(G, H, 4 , B, G, D, E, F, / ) , and (G, H, G, D , E, 4 , B , F, I). If we follow the traditional
wisdom of visiting the subtree that uses more memory first, we obtain the best o f
the four traversals, which is {G, H ,C , D , E , A, B , F, I) . Its overall memory usage is
67
44 units, as shown in Table 3.2(b), and is not optimal. The optimal traversal is
(C, D, G, H, .4, B, E, F. I) , which uses 39 units o f memory (see Table 3.2(c)). Notice
that it ‘jum ps' back and forth between the subtrees. Therefore, any algorithm that
only consider traversals that visit subtrees one after another may not produce an
optim al solution.
One possible approach to the unfused memorv- usage optim ization problem is to
apply dynam ic programming on an expression tree as follows. Each traversal can
be viewed as going through a sequence of configurations, each configuration being
a set o f nodes that have been evaluated (which can be represented more compactly
as a smaller set o f nodes in which none is an ancestor or descendant of another).
In other words, the set of nodes in a prefix o f a traversal forms a configuration.
Common configurations in different traversals form overlapping sub-problems. A
configuration can be formed in many ways, corresponding to different orderings of
the nodes. The optimal way to form a configuration Z containing k nodes can be
obtained by minimizing over all valid configurations that are k — 1-subsets of Z. By
finding the optimal costs for all configurations in the order of increasing number of
nodes, we get an optimal traversal o f the expression tree. However, this approach is
inefficient in that the number o f configurations is exponential in the number of nodes.
The unfused memory usage optim ization problem has an interesting property: an
expression tree or a subtree may have more than one optimal traversal. For example,
for the subtree rooted at F, the traversals (C, D , F, A, B, F) and (C, D, A, B, F, F)
both use the least memory space of 39 units. One might attem pt to take two optimal
subtree traversals, one from each child of a node AT, merge them together optimally,
and then append X to form a traversal for X . Although the optim al merge can
68
be evaluated in 0 { n m ) tim e using dynamic programming (where n and m are the
lengths of the two subtree traversals), the resulting traversal may not be optimal for
X . Continuing with the above example, if we merge together (C, D , E . .4, B, F) and
{G, H) (which are optimal for the subtrees rooted at F and H, respectively) and then
append / , the best we can get is a sub-optimal traversal (G, H, C, D , E, .4, B, F, I)
that uses 44 units of memory (see Table 3.2(b)). However, the other optim al traversal
(G, D, .4, B, E, F) for the subtree rooted at F can be merged with (G, H) to form
{C, D . G , H, A, B, E, F, I) (w ith I appended), which is an optim al traversal of the
entire expression tree. Hence, some optimal traversals of a subtree m ay not appear as
subsequences in any optim al traversal of the entire expression tree. In other words,
locally optimal traversals may not be globally optimal. In the next section, we present
an efficient algorithm that finds traversals which are not only locally optim al but also
globally optimal.
3.4.2 An Efficient Algorithm
We now present an efficient divide-and-conquer algorithm that, given an expres
sion tree whose nodes are large data objects, finds an evaluation order o f the tree that
minimizes the memor\- usage. For each node in the expression tree, it computes an
optimal traversal for the subtree rooted at that node. The optim al subtree traversal
that it computes has a special property: it is not only locally optim al for the subtree,
but also globally optimal in the sense that it can be merged together with globally
optimal traversals for other subtrees to form an optim al traversal for a larger subtree
69
which is also globally optimal. As we have seen in Section 3.4.1, not all locally op
tim al traversals for a subtree can be used to form an optimal traversal for a larger
tree.
The algorithm stores a traversal not as an ordered list of nodes, but as an ordered
list of indivisible units called elements. Each element contains an ordered list of
nodes with the property that there necessarily exists some globally optimal traversal
of the entire tree wherein this sequence appears undivided. Therefore, as we show
later, inserting any node in between the nodes of an element does not lower the total
memory usage. An element initially contains a single node. But as the algorithm
goes up the tree merging traversals together and appending new nodes to them,
elements may be appended together to form new elements containing a larger number
of nodes. Moreover, the order of indivisible units in a traversal stays invariant, i.e.,
the indivisible units must appear in the same order in some optimal traversal o f the
entire expression tree. This means that indivisible units can be treated as a whole and
we only need to consider the relative order of indivisible units from different subtrees.
Each element (or indivisible unit) in a traversal is a {nodelist, hi, la) triple, where
nodelist is an ordered list of nodes, hi is the maximum memorj’’ usage during the
evaluation of the nodes in nodelist, and lo is the memory usage after those nodes are
evaluated. Using the terminology from Section 3.4.1, hi is the highest himem among
the nodes in nodelist, and lo is the lomem of the last node in nodelist. The algorithm
always maintains the elements of a traversal in decreasing hi and increasing lo order,
which implies in order of decreasing hi-lo difference. In Section 3.4.3, we prove that
arranging the indivisible units in this order minimizes memory usage.
70
M inM em T raversal (T):foreach node v in some bottom-up traversal of T
v.seq = 0 / / an empty listforeach child u of v
v.seq = M ergeSeq(u.seç, u.seç) if \v.seq\ > 0 th en / / |z| is the length of x
base = V .seq\\v .seq\\.lo else
base = 0A p p en d S eq [v.seq, {v), v.size -f base, v.size)
nodelist = () for 2 = 1 to \T.root.seq\
nodelist = nodelist + T.root.seq[i\.nodelist / / 4- is the concatenation operatorreturn nodelist / / memory usage is T.root.seq[l\.hi
M ergeSeq (S’1,52):S = {) i = j = I basel = base2 = 0 w hile i < |5 l | or j < |52|
i f J > |52| or [i < [S'il and S'1 [î].A2 — S’1[z].1o > S2\j\.hi — S2\j\.lo) th en A pp en dSeq {S, S\\i].nodelist, Sl\i].hi + basel, 5l[z]./o -f- basel) baseS = 5l[2]./o z4--f-
elseA pp en dSeq (S, S2[j].nodelist, S2\j\.hi 4- baseS, S2\j].lo 4- baseS) basel = S\]j].loj + +
end w hile return S
A p p en d S eq {S, nodelist, hi, lo):E = {nodelist, hi, lo) / / new element to append to S i = \S\w h ile 2 > 1 and {E.hi > S[i].hi or E.lo < 5[2’]./o)
E = {S[i].nodelist +E.nodelist, max.{S[i].hi, E.hi), E.lo) / / [z] is combined into E remove 5[z’] from S i —
end w hileS = S + E / / |5 | is now 2 4-1
Figure 3.9: Procedure for finding an memory-optim al traversal of an expression tree.
71
Figure 3.9 shows the algorithm. The input to the algorithm (the M in M e m
T raversa l procedure) is an expression tree T, in which each node v has a field v.size
denoting the size o f its data object. The procedure performs a bottom-up traversal
of the tree and, for each node v. computes an optim al traversal v.seq for the subtree
rooted at v. The optim al traversal v.seq is obtained by optim ally merging together
the optimal traversals u.seq from each child u of u, and then appending v. At the end,
the procedure returns a concatenation of all the nodelists in T.root.seq as the optimal
traversal for the given expression tree. The memory usage of the optimal traversal is
T.root.seq[l].hi.
The M e r g e S e q procedure merges two given traversals 51 and 52 optim ally and
returns the merged result 5 . 51 and 52 are subtree traversals of two children nodes
of the same parent. The optimal merge is performed in a fashion similar to merge-
sort. Elements from 51 and 52 are scanned sequentially and appended into 5 in the
order of decreasing hi-lo difference. This order guarantees that the indivisible units
are arranged to minimize memory usage. Since 51 and 5 2 are formed independently,
the hi-lo values in the elements from 51 and 52 must be adjusted before they can be
appended to 5 . The amount of adjustment for an element from 51 (52) equals the
lo value o f the last merged element from 5 2 (51), which is kept in basel (baseS).
The A p p e n d S e q procedure appends a new elem ent specified by nodelist, hi, and
lo to the given traversal 5 . Before the new element E is appended to 5, it is combined
with elements at the end of 5 whose hi is not higher than E.hi or whose lo is not lower
than E.lo. The combined element has the concatenated nodelist and the highest hi
but the original E.lo. Elements are combined to form larger indivisible units.
72
I : 16
H : 5
A : 20 D : 9
C : 30
Node V Optimal traversal v.seqA ((A, 20,20))B ((AB,23,3))a ((C,30,30)>D ((CD, 39,9))E ((CD, 39,9), (F, 25,16))F ((CD, 39,9), {ABEF, 34,15))G ((C,25,25))H ((C ff,30,5))I { {C D G H A B E F I ,39 ,16))
(a) The expression tree in Figure 3.8. (b) Optimal traversais for subtrees.
Figure 3.10: Optimal traversais for the subtrees in the expression, tree in Figure 3.8.
To illustrate how the algorithm works, consider the expression tree shown in Fig
ure 3.8 and reproduced in Figure 3.10(a). We visit the nodes in a bottom-up or
der. Since A has no children, A.seq = ((A, 20, 20)) (for clarity, we write nodelists
in a sequence as strings). To form B.seq, we take A.seq and append a new element
{B, 3 -r 20,3) to it. The A p p en d S e q procedure combines the two elements into one,
leaving B.seq = {{AB, 23 ,3 )). Here, A and B form an indivisible unit, implying that
B must follow A in some optimal traversal of the entire expression tree. Similarly, we
get E.seq = { {C D , 39 ,9), {E, 25,16)). For node F, which has two children B and E,
we merge B.seq and E.seq by the order of decreasing hi-lo difference. So, the elements
merged are first {C D , 39 ,9), then {AB, 23-1-9, 34-9), and finally {E, 254-3,164-3) with
the adjustments shown. They are the three elements in E.seq after the merge as no
elements are combined so far. Then, we append to E.seq a new element {F, 15-1-19,15)
for the root of the subtree. The new element is combined with the last two elements
in E.seq. Hence, the final content of E.seq is ((C D , 39,9), (A 5 £ 'F , 34,15)), which
73
consists o f two indivisible units. The optimal traversals for the other nodes are com
puted in the same way and are shown in Figure 3.10(b). At the end, the algorithm
returns the optimal traversal (C, D , G, H, A, B, E, F, I) for the entire expression tree
(see Table 3.2(c)).
The time complexity of this algorithm is O(n^) for an n-node expression tree
because the processing for each node v takes 0 { m ) tim e, where m is the number of
nodes in the subtree rooted at v. Although the A p p e n d S e q procedure has a while-
loop in it, the total number o f iterations of the loop during the processing for a node
cannot exceed m. On average, the performance o f the algorithm could be close to
0 { n ) since the frequent combinations of the elements in a traversal keep the traversals
relatively short. Another feature of this algorithm is that the traversal it finds for a
subtree T' is not only optimal for T' but must also appear as a subsequence in some
optimal traversal for any larger tree that contains T' as a. subtree. For example, E.seq
is a subsequence in E.seq, which is in turn a subsequence in I.seq (see Figure 3.10(b)).
3.4.3 Correctness of the Algorithm
We now show the correctness of the algorithm. The first lemma establishes some
important properties about the elements in an ordered list v.seq that represents a
traversal.
L em m a 1 Let v be any node in an expression tree, S = v.seq, and P be the traversal
represented by S of the subtree rooted at v, i.e., P = S[l].nodelist-l t-FQFIj.nodeZist.
The algorithm maintains the following invariants:
For all 1 < 2 < |F |, let S[i\.nodelist = (vi ,V2 , . . ■ ,Vn) and Vm be the
last node in S[i].nodelist that has the maximum himem value, i.e., for all
74
k < m , himem(%, P ) < himem(î;,„, P) and for ail k > m, himem(%T P ) <
himem(uni, P ) . Then, we have,
1. S[i].hi = him em (Urn, P ) ,
2. S[i]Jo = lom em (fn, P ),
3. for all m < A: < n, lomem(i;fc, P ) > lomem(u„, P ),
4. for all 1 < i < i,
(a) for all 1 < A: < n, > himem(%, P ) ,
(b) for all 1 < A; < n, S[j].lo < lomem(%, P ),
(c) > 5[z]./iz, and
(d) 5[j]./o < 5[z]./o.
P roof
The above invariants are true by construction. ■
The second lemma asserts the ‘indivisibility’ of an indivisible unit by showing that
unrelated nodes inserted in between the nodes of an indivisible unit can always be
moved to the beginning or the end of the indivisible unit without increasing memory
usage. This lemma allows us to treat each traversal as a sequence of indivisible units
(each containing one or more nodes) instead of a list of the individual nodes.
L em m a 2 Let u be a node in an expression tree T, S = v.seq, and P be a traversal
of T in which the nodes from S[i\.nodelist appear in the same order as they are in
S[i\.nodelist, but not contiguously. Then, any nodes that are in between the nodes in
S[i].nodelist can always be moved to the beginning or the end of S[i].nodelist without
75
increasing memory usage, provided that none of the nodes that are in between the
nodes in S[t].nodelist are ancestors or descendants of any nodes in S[i].nodelist.
P ro o f
Let S[i].nodelist = Vq be the node before uj. in 5 , Vm be the node
in S[i].nodelist such that for all k < m, himem(ut, P ) < himem(u,„, P ) and for
all k > m, himem(Urn,5") > him em (ufc,5). Let be the 'foreign’ nodes,
i.e., the nodes that are in between the nodes in S[i\.nodelist in P, w ith u[ , . . . , u '
(not necessarily contiguously) before and . . . , (not necessarily contigu
ously) after Vm in P. Let Q be the traversal obtained from P by removing the
nodes in S[i].nodelist. We construct another traversal P' o f T from P by moving
to the beginning of S[i].nodelist and . . . , to the end of S[i].nodelist.
In other words, we replace { v i , . . . , . . . , . . . , Vm, ■ ■ ■-. . . . , Uj, . . . , u„) in P
with { v [ , . . . , u' , u i , . . . , . . . , Ug+i, - - . , Ufi, ) to from P ' .
P and P' differ in memory usage only at the set of nodes (u i , . . . , V n , v \ , . . . ,
P ' does not use more memorj^ than P because:
1. The memory usage of Vm is the same in P and P' because himem(u,n, P ) =
himem(um, P') = himem(um, S) -I- lomem(u^, Q).
2. For all 1 < A: < n, since himem(u^, S) < himem(u^, 5 ) , we have himem(u^, P') =
himem(%, S) -hlomem(u^, Q) < himem(uni, P ) = himem(um, 5 ) 4-lom em (u', Q).
3. Since for all 1 < A: < m, lom em (uo,5) < lomem(ut. S') (by invariant 4(b)
in Lemma 1), we have, for all 1 < j < a, himem(z;', P') = himem(u^, Q) -f-
lomem(uo, S) < himem(u'-, P ) .
76
4. Since for all m < A: < n, lomem(t%, S') > lomem(u„, S) (by invariant 3 in
Lemma 1), we have, for all a < j < b, himem(t?'-, P') — himem(u'-, Q) +
lomem(z;„,S) < himem(u' , P ).
Since the memory usage of any node in v i , . . . ,Vn after moving the foreign nodes
cannot exceed that of Vm, which remains unchanged, and the memory usage of the
foreign nodes does increase as a result o f moving them, the overall maximum memory
usage cannot increase. ■
The next lemma deals with the ordering of indivisible units. It shows that ar
ranging indivisible units in the order of decreasing hi-lo difference minimizes memory
usage. This is because two indivisible units that are not in that order can be inter
changed in the merged traversal without increasing memory usage.
L em m a 3 Let v and v' be two nodes in an expression tree that are siblings of each
other, S = v.seq, and S' = v'.seq. Then, among all possible merges of S and S',
the merge that arranges the elements from S and S' in the order of decreasing hi-lo
difference uses the least memory.
P roof
Let M be a merge of 5 and S' that is not in the order of decreasing hi-lo difference.
Then there exists an adjacent pair o f elements, one from each of S and S', that are
not in that order. W ithout loss o f generality, we assume the first element is S'[j] from
S' and the second one is S[z] from S. Consider the merge M' obtained from M by
interchanging P'[j] and S[z]. To simplify the notation, let Hr = S[r]./iz, Lr = 5[r]./o,
H'j. = S'[r].hi, and = S'[r].lo. The memory usage of M and M' differs only at S'[j]
and 5[i] and is compared in Figure 3.11.
77
element himem lomem
i ;
S V ] Hj 4- Li—i L'i 4-5'hl Hi 4- L' Li + L'i
;
element himem lomem:
5[z] Hi + Li 4-H'i -h Li L'i 4- Li
: \
(a) Sequence M (b) Sequence M'
Figure 3.11: Memory usage comparison of two traversals in Lemma 3.
The memory usage of M at the two elements is m ax(i7' + Hi + L'-) while the
memory' usage o f M ' at the same two elements is m ax(iif'+L i, Hi+L'^_]) Since the two
elements are out of order, the hi-lo diflference of S'[j] must be less than that of 5[z],
i.e., H'j — L'j < Hi —Li- This implies Hi-\-L'j > H' + Li. Invariant 4 in Lemma 1 gives
us Lj > L '_ i, which implies Hi + L'- > Hi + L'_^. Thus, m ax(iJj 4- L j-i, Hi + L'-) >
m ax(iïy + Li, Hi + Therefore, M' cannot use more memory than M . By
switching all adjacent pairs in M that are out of order until no such pair exists, we
get an optim al order without increasing memory usage. ■
T h e o r e m 4 Given an expression tree, the algorithm presented in Section 3.4.2 com
putes a traversal that uses the least memory.
P r o o f
We prove the correctness of the algorithm by describing a procedure that trans
forms any given traversal to the traversal found by the algorithm without increase in
memory usage in any transformation step. Given a traversal P for an expression tree
T, we visit the nodes in T in a bottom-up manner and, for each non-leaf node v in
T, we perform the following steps:
78
1. Let T' be the subtree rooted at v and P' be the minim al substring of P that
contains all the nodes from T' — {v } . In the following steps, we will rearrange
the nodes in P' so that the nodes that form an indivisible unit in v.seq are
contiguous and the indivisible units are in the same order as they are in v.seq.
2. First, we sort the components of the indivisible units in v.seq so that they are in
the same order as in v.seq. The sorting process involves rearranging tw o kinds
of units. The first kind of units are the indivisible units in u.seq for each child
u of V. The second kind of units are the contiguous sequences o f nodes in P'
which are from T — T'. For this sorting step, we temporarily treat each such
maximal contiguous sequence of nodes as a unit. For each unit E of the second
kind, we take E .h i = max^e£;himem(ic, P ) and E.lo = lomem(wn, P ) where Wn
is the last node in E. The sorting process is as follows.
W hile there exist two adjacent units E' and E in P' such that E' is
before E and E'.hi — E'.lo < E .h i — E.lo,
(a) Swap E' and E. By Lemma 3, this does not increase the memory
usage.
(b) If two units of the second kind becomes adjacent to each other
as a result o f the swapping, combine the two units into one and
recompute its new hi and lo.
When the above sorting process finishes, all units of the first kind, which are
components o f the indivisible units in v.seq, are in the order of decreasing hi-lo
difference. Since, for each child u of v, indivisible units in u.seq have been in the
correct order before the sorting process, their relative order is not changed. The
79
order of the nodes from T — T' is preserved because the sorting process never
swaps two units o f the second kind. Also, v and its ancestors do not appear
in P', and nodes in units of the first kind are not ancestors or descendants of
any nodes in units of the second kind. Therefore, the sorting process does not
violate parent-child dependences.
3. Now that the components of the indivisible units in v.seq are in the correct
order, we make the indivisible units contiguous using the following combining
process.
For each indivisible unit E in v.seq,
(a) In the traversal P , if there are nodes from T — T' in between the
nodes from E, move them either to the beginning or the end of
E as specified by Lemma 2.
(b) Make the contiguous sequence of nodes from E an indivisible
unit.
Upon completion, each indivisible unit in v.seq is contiguous in P and the order
in P of the indivisible units is the same as they are in v.seq. According to
Lemma 2, moving ‘foreign’ nodes out of an indm sible unit does not increase
the memory usage. Also, the order o f the nodes from T —T' is preserved. Hence,
the combining process does not violate parent-child dependences.
We use induction to show that the above procedure correctly transforms any
given traversal P into an optimal traversal found by the algorithm. The induction
hypothesis H{u) for each node u is that:
80
• the nodes in each indivisible unit in u.seq appears contiguously in P and are in
the sam e order as they are in u.seq, and
• the order in P o f the indivisible units in u.seq is the same as they are in u.seq.
Initially, H{ u) is true for ever}' leaf node u because there is only one traversal order
for a leaf node. As the induction step, assume H{u) is true for each child u of a
node V. The procedure rearranges the nodes in P' so that the nodes that form an
indm sible unit in v.seq are contiguous in P, the sets of nodes corresponding to the
indivisible units are in the same order in P as they are in v.seq, and the order among
the nodes that are not in the subtree rooted at v is preserved. Thus, when the
procedure finishes processing a node v, H{v) becomes true. By induction, H{T.root)
is true and a traversal found by the algorithm is obtained. Since any traversal P can
be transformed into a traversal found by the algorithm without increasing memor}'
usage in any transformation step, no traversal can use less memory and the algorithm
is correct. ■
3.5 Dynam ic Memory Allocation
Under dynamic memory allocation, space for arrays can be allocated and deallo
cated as needed. We allow an array to be allocated/ deallocated multiple times by
putting the allocate/deallocate statem ents inside some loops. But each array must
be allocated/ deallocated in its entirety. We do not consider sharing of space between
arrays. Given a loop fusion configuration, the positions of the allocate/deallocate
statem ents are determined as follows. Let v be an array, Se be the statement that
evaluates v, be the statement that uses v, and the t-loop be the innermost loop
that contains both Se and s^. The latest point an “allocate v" statement can be
81
placed is inside the t-Ioop and before Sg and any loops that surround Sg and are in
side the t-Ioop. The earliest point a "deallocate u" statem ent can be placed is inside
the t-loop and after and any loops that surround and are inside the t-loop. The
memory usage of a loop fusion configuration is no longer the sum of the array sizes,
but maximum total size o f the array that are allocated at any time.
Another difference between dynamic allocation and static allocation is that, with
dynamic allocation, the evaluation order o f the nodes having equal fusion with their
parents can affect memory usage although the sizes of individual arrays remain un
changed. Consider the fusion graph shown in Figure 3.12(a). Since / i and /s are
equally fused with f^, either of the two subtrees rooted at / i and fs can be evaluated
first. Figure 3.12(b) and (c) show the loop fusion configurations for the two evaluation
orders. .Assume the same index ranges Ni = 500, Nj = 100, Nk = 40, and N[ = 15
as before. In (b), the maximum memory usage is 4141 elements when f i , fs , / i , and
/g are allocated during the evaluation of / t and /g. But in (c), when f i is being
evaluated, .4, / i , and /g coexist in memory and their total size is 4600 elements.
Therefore, in addition to finding the fusions between children and parents, we
need to determine the evaluation order of the nodes that minimizes memory usage.
A simpler problem of finding the memory-optimal evaluation order of the nodes in an
expression tree without any fusion has been addressed in Section 3.4. Here, we apply
its result to each set of nodes having equal fusion to obtain the optimal evaluation
order among them.
We use the same bottom -up, dynamic programming algorithm as in Figures 3.3 to
3.6 to traverse the expression tree and enumerate the legal fusions for the nodes. But
the procedures related to calculations of memory usage (namely, In itM e m U sa g e ,
82
f o
h
Ok
h
AT-3
Bk I k l
C
a l l o c & i n i t f l C N j ] f o r j
a l l o c A [ N i ] f o r i[ A C i ] = g e n A ( i , j ) f o r i[ f l [ j ] + = A [ i ] f r e e A [ N i ]
a l l o c & i n i t f 3 [ N j , N k ] f o r j
a l l o c C [N 1 ] f o r 1[ C [ l ] = g e n C ( k , l ) f o r k
a l l o c B [N 1 ] f o r 1[ B [ l ] = g e n B ( j , k , l ) f o r 1
a l l o c f 2 f 2 = B [ l ] x C [ l ] f 3 [ j , k ] + = f 2 f r e e f 2
f r e e B [N 1 ] f r e e C [N 1 ]
a l l o c & i n i t f 5 [ N k ] f o r j
f o r ka l l o c f 4f 4 = f l [ j ] x f 3 [ j , k ] f 5 [ k ] + = f 4 f r e e f 4
f r e e f l [ N j ] , f 3 [ N j , N k ]
a l l o c k i n i t f 3 [ N j , N k ] f o r j
a l l o c C [N 1] f o r 1[ C [ 1 ] = g e n C ( k , 1 ) f o r k
a l l o c B [N 1] f o r 1[ B [ l ] = g e n B C j , k , l ) f o r 1
a l l o c f 2 f 2=B [1 ] x C [ l ] f 3 [ j , k ] + = f 2 f r e e f 2
f r e e B [N 1] f r e e C [N 1]
a l l o c k i n i t f 1 [ N j ] f o r j
a l l o c A [ N i ] f o r i[ A [ i ] = g e n A ( i , j ) f o r i[ f l [ j ] + = A [ i ] f r e e A [ N i ]
a l l o c k i n i t f 5 [ N k ] f o r j
f o r ka l l o c f 4f 4 = f l [ j ] x f 3 [ j , k ] f 5 [ k ] + = f 4 f r e e f 4
f r e e f l [ N j ] , f 3 [ N j , N k ]
( a ) (b) (c )
Figure 3.12: A fusion graph with equal fusions and its two evaluation orders.
8 3
InitMemUsage (s): s.seqset = 0
AddMemUsage (u ,/ , size, s l . s ) :
s.seqset = sl.seqset X = Set (/)foreach x' E s.seqset s.t . x C x' in decreasing \x'\
CoUapseSeq (s.seqset,x',x)If z 0 s.seqset then
x.seq = 0s.seqset = s.seqsetU {x} base = 0
elsebase = x.seq\\x.seq .lo
AppendSeq (x.seq, (v),size + base, size)MergeMemUsage ( s l , s 2 , s ) ;
s.seqset = sl.seqset foreach i ' E s2.seqset
if 3x E s.seqset s .t . x = x' then MergeSeq (x.seq, x'.seq)
elses.seqset = s.seqsetiJ {x'}
CoUapseSeq (QQ,x',x):if 3y E QQ s.t. x Ç y C x ' then
y = NextSmaUer (QQ,x') base = y .sec^y .seq\].lo
else y = x y.seq = () base = 0
for 2 = 1 to \x'.seq\AppendSeq (y.seq,x'.set i].nodelist,x'.se(j i\.hi + base,x'.sei ij.lo + base)
Figure 3.13: Algorithm for dynamic memory allocation.
8 4
MergeSeq (Q1.Q2):0 = 0Z = J = 1basel = base2 = 0 while i < |Q1| or j < \Q2\
if j > \Q2\ or (z < |Q1| and Q\[i].hi— Ql[z']./o > Q2\j\.hi — Q2\j].lo) then AppendSeq {Q,Ql\i\.nodelist,Q\.[i\.hi+ basel, Ql\i\.lo + basel) base2 = Ql[z']./o Z++
elseAppendSeq {Q, Q2{j].nodelist, Q2\j\.hi 4- base2, Q2\j].lo 4- base2) basel = Ql[j].loy++
end while return Q
AppendSeq (Q, nodelist, hi, la):E = {nodelist, hi, lo) / / new element to append to Qi = \Q\while z > 1 and {E.hi > Q[i\.hi or E.lo < Q\i\.lo)
E = {Q ].nodelist 4- E.nodelist, max(Q[z]./zz, E.hi), E.lo) remove Q[i] from Q / / Q\i\ is combined into E i —
end whileQ = Q + E / / IQI is now z 4-1
Inferior (s, s') = s.nesting Ç s'.nesting and InferiorSeqSet {s.seqset, s'.seqset)
InferiorSeqSet {QQ, QQ') =Vz/, High iQQ,y) > High (QQ',y) and Low (QQ,y) > Low (QQ',y)
High (QQ,y) = x.seç[l]./zz’4-Low {QQ,x')where x - - NextSmaller {QQ,y) and x' = NextSmaller {QQ,x')
Low {QQ,y) = 6 QQ and I Ç ÿ) x.se4\x.seq\].lo
NextSmaller {QQ,y) = max{% 6 001/ C z/}
Figure 3.14: Algorithm for dynamic memory allocation (cont.)
8 5
A d d M e m U sa g e , M e r g e M e m U sa g e , and In fer io r) are replaced by those in Fig
ures 3.13 and 3.14.
To correctly calculate memory usage and to determine the optim al evaluation
order o f nodes with equal fusion, instead of a m em field, each solution for a node
V now has a seqset field, which is a set of fusion indexsets Set { f ) for the fusions /
in the subtree rooted at v. Each fusion indexset x in s.seqset is associated with an
x.seq^ which is a sequence of indivisible units. Each indivisible unit is a {nodelist,
hi, Id) triple, where nodelist is an ordered list of nodes, hi is the maximum memory
usage during the evaluation of the nodes in nodelist, and lo is the memory' usage after
those nodes are evaluated. The nodes in a nodelist have the special property that
there necessarily exists some globally optim al traversal of the entire tree wherein this
sequence appears undivided. Therefore, inserting any node in between the nodes of
an indivisible unit does not lower the total memory usage.
The seqsets and seqs are manipulated as follows. When two solutions from two
children nodes of the same parent are merged together, their seqsets are merged by
the M e r g e M e m U sa g e procedure. A seq for a fusion indexset that appears in only
one seqset is sim ply copied to the result seqset. If a fusion indexset x appears in both
seqsets, the indivisible units in the two seqs for x are “merge-sorted” together in the
order of decreasing hi-lo difiFerence (by the M er g e S e q procedure) to form a combined
seq and their hi-lo values are adjusted. In Section 3.4, we have proved that arranging
the indivisible units in this order minimizes memory usage. Moreover, the indivisible
units in a seq must appear in the same order in som e optimal traversal of the entire
expression tree.
86
V f u s i o nV.seqset
X x.seqA (;) { j } (((A), 500,500))B ( jk) {j-.k} (((B), 15,15))C {k) { k } (((C), 15,15))f i 0 0 ( ( (A ,/ i ) ,6 0 0 ,100))/ 2 { h j , l ) { t } ( ( (0 ,1 5 ,1 5 ) )
k) (((B), 15,15))O ' , k , l } ( ( ( A ) , 1 , 1 ) )
/ s 0 0 (((C ,B,/2 ,/3),4031,4000))h 0 (((A, / i , C, B , /2, /s ) , 4131,4100))
0 , k} ( ( ( A ) , 1 , 1 ) )h 0 0 ( ( (A ,/ i ,C ,B ,/2 , /3 ,A ,/3 ) ,4 1 4 1 ,4 0 ))
Table 3.3: The seqsets for the fusion graph in Figure 3.12(a).
When a solution is extended to include a new node v having fusion / with its
parent, the seqs in v.seqset for fusion indexsets larger than Set { f ) are collapsed into
the seq for Set { f ) (by the C o lla p se S e q procedure). In collapsing a seq Q into another
seq Q', all the indivisible units in Q are appended to Q' after their hi-lo values are
adjusted. When all collapsings are done, all fusion indexsets in v.seqset are subsets
of Set { f ) . Then, a new indivisible unit for v is appended to the seq for Set{f) .
Whenever an indivisible unit E is appended to a seq Q by the A p p en d S e q
procedure, it is combined with indivisible units at the end of Q whose hi is not higher
than E.hi or whose lo is not lower than E.lo. The combined indivisible unit has the
concatenated nodelist and the highest hi but the original E.lo. An indivisible unit
initially contains a single node. But as the algorithm traverses up the tree, indivisible
units may be combined to form ones that contain more nodes.
87
As the procedures for enumerating legal fusions are the same for both static and
dynamic memory allocation, we will use a particular fusion graph, the one in Fig
ure 3.12(a), to illustrate how memory usage is maintained by the algorithm under
dynamic allocation. For node A, the fusion is ( j ) and the fused array size is iV, — 500.
So, A.seqset has a single fusion indexset { j } whose associated seq is (((A ), 500,500)).
The seqsets for B and C are similarly obtained. Since / i is not fused with / i , the
A d d M e m U sa g e procedure collapses the only seq in A.seqset from fusion indexset
{ j } to 0. When a new indivisible unit ( ( / i ) , 600,100) for f i is appended to the seq for
0, it is combined with the existing indivisible unit into ((A, / i ) , 600,100). Now, A and
f i becomes indivisible. For node /o , the seqsets for B and C (which share no common
fusion indexset) are first merged together by M er g e M e m U sa g e and then a new seq
for { j , k , l } = Set { {k, j , l ) ) is created for /g- Thus, f 2 -seqset has 3 seqs. To form the
seqset for /a, which is not fused with the A d d M e m U sa g e procedure first col
lapses the 3 seqs in f 2 .seqset into (((C , B ,/a ) , 31 ,31)), and then appends to it a new
indivisible unit ((/a), 4031,4000) to form the single seq (((C , 5 , / a , /a ) , 4031,4000))
in fs.seqset. The seqsets for all the nodes are shown in Table 3.3. The single seq
in the root node’s seqset contains the optimal evaluation order of the nodes and the
hi value of the first indivisible unit in that seq is the maximum memory usage (see
Figure 3.12(b)).
3.6 Common Sub-Expressions
A formula sequence for a multi-dimensional summation with common sub-expressions
may need a directed acyclic graph (DAG) representation instead of an expression tree.
88
The construction of a potential fusion graph G for a DAG is as follows, which is dif
ferent from that for a tree described in Section 3.1.
1. For each node u in the DAG, create in G a set of vertices, one for each dimension
of array u. If u is a summation node, add a vertex for the summation index.
2. In each formula in the given formula sequence, if dimension d or the summation
index of the result array u shares the same index variable with dimension d'
of an operand array u', then connect in G the vertex for dimension d or the
summation index of node u and the vertex for dimension d' of node u' with a
potential fusion edge.
As an example, consider the multi-dimensional summation in Figure 3.15(a), in
which arrays A and B appear twice. .A. formula sequence for computing the summa
tion and its DAG representation are shown in Figure 3.15(b) and (c), respectively.
Figure 3.15(d) shows a potential fusion graph constructed as explained above for the
DAG.
Notice that, unlike a fusion graph for a tree, the vertices in a (potential) fusion
graph for a DAG are labeled not by index variable names but by dimension numbers.
This is because multiple references to an array may have different index variables,
which makes index renaming possible and defies a unique mapping between array
dimensions and index variables.
A fusion graph is a potential fusion graph with a subset of the potential fusion
edges turned into fusion edges. However, some fusion graphs are illegal, i.e., they do
not correspond to loop fusion configurations that correctly compute the final result.
A fusion graph G for a DAG is legal if it satisfies the following requirements.
8 9
X B[Lj\ X A\j,k] X B\j,k] x C\j,k])
(a) A multi-dimensional summation with common sub-expressions
h \j] =h [7 , k] = / i \j, fc] X C\j, Â:]AM = Hkfzlj.k]
= h \ j ] = AM X AM
(b) A formula sequence computing (a)
A X
Èfc A
A E i X /3 A
A
A •’• . IS
: : AE 1’A
A ÿ •> : . C12
12g
I 2
A X CM 6]
A[i,j] B[i,j]
(c) A DAG representation for (b) (d) A potential fusion graph for (c)
Figure 3.15: An example multi-dimensional summation with common sub-expressions and representations of a computation.
90
I/s J.
If-o •-
/ s •. 1 E
: - 4 - AS 1/ E I-
h y y.2 J3 A V-. y s' /s
C ^ c1 2 I 2
A \ ^ B A ^ B1 2 2 1 2 2
(a) (b)
I 1 1/5 A ^ h /-
L E / '. I S
/ ' : A r h / > /4E 1/ S l/
/2 3 ^ - /a A Vy ? h / 2 Vy .2 /S
' ' G /i c Cy / \ \ I 2 I 2 I 2
A ^ B A *• s A / y w B12 12 12 12 12 I :
(c) (d) (e)
Figure 3.16: Examples of illegal fusion graphs for a DAG.
91
1. The scopes of any two fusion chains have to be disjoint or a subset/superset
of each other. This was the only requirement on a legal fusion graph for a
tree (see Section 3.1). For a DAG, it also means that the scopes of fusion
chains along different paths from a multi-parent node v to DOM(u) must not
partially overlap. Figure 3.16(a) shows an example fusion graph that violates
this requirement. It is illegal because the two loops for the two fusion chains
are nested but none of them can be the outer loop.
2. A multi-parent node v must have the same fusion with all its parents. In other
words, each dimension of v must be fused with either all the parents of v or
none of them. This requirement ensures v is computed only once and has only
one size. For example, in Figure 3.16(b), the second dimension of f i is fused
with /a but not /g- So, the fusion graph is illegal.
3. If a multi-parent node v and an ancestor v' o f v are in the scope of the same
fusion chain c, then any node u that is an ancestor of v and a descendant of v'
must also be in the scope of c. If u is not in the scope of c, u would be outside the
loop that corresponds to c and contains both v and v'. But the evaluation of u
can neither be before nor after the loop because u depends on v and v' depends
on u. This requirement holds even if no potential fusion edge(s) exists to extend
the scope of c to include u. The example fusion graph in Figure 3.16(c) does
not satisfy this requirement since / i , /g , fz, and /a are in the scope of a fusion
chain, but is not.
92
fô
h
.4 I 2 1 2
C
initialize f2 initialize f4 for i
for jA=genA(i,j) B=genB(i,j)C=genC(i,j) fl=AxB f2[j]+=fl f3=flxC f4[i]+=f3
for j[ f5[j]=f2[j]xf4Cj]
A B
for i fo r j
A=genA(i,j)B=genBCi,j)f lC i. j]=AxB
fo r ji n i t i a l i z e f2 for i[ f2+ = fl[ i . j ] i n i t i a l i z e f4 for k
C=genCCi,j) f3=fl[ j ,k ]xC f4+=f3
fS[j]=f2xf4
(a) (b)
Figure 3.17: Examples of legal fusion graphs and corresponding loop fusion configurations for a DAG.
4. A fusion chain must not connect two different vertices o f the same node. For
instance, the fusion chain in Figure 3.16(d) connects both vertices of f i . Hence,
the fusion graph is illegal.
5. If a node u and its parent u' are in the scope o f the same fusion chain c, then
there must be either a fusion edge in c between u and u', or a potential fusion
edge that connects the two vertices in u and u' that c connects. Thus, the
fusion graph in Figure 3.16(e) is illegal since / 4 and /s are in the scope of a
fusion chain, but there is no fusion edge or potential fusion edge that connects
the sum m ation vertex of and the only vertex o f f^.
Figure 3.17 shows two legal fusion graphs and their corresponding loop fusion config
urations.
93
A is denseSparsity entry in B and f i :
(*, 7 , 0 .1 )
(a) (b)
for i for j[ ACj]=genACi,j) for j€nonzerosl(i) r B=genB(i,j)L flCi, xB
(c)
Figure 3.18: An example of legal loop fusions for sparse arrays.
3.7 Sparse Arrays
As discussed in Section 2.7, sparsity in an input or intermediate array can be
represented as a list of sparsity entries. Each sparsity entry consists of the two
dimensions of the array involving the sparsity and a sparsity factor, which equals
the fraction of non-zero array elements. Since we assumed earlier when we count
arithmetic operations for sparse arrays that only non-zero elements in result arrays
are formed, the loops corresponding to sparse dimensions of a result array should
iterate over non-zero elements only. For a pair of loops to be fusible among a set of
nodes, the two loops must have compatible loop ranges, i.e., they have to iterate over
the same number of non-zero elements in each of those nodes. This leads to a new
requirement on a legal fusion graph: W hen a pair o f loops are fused among a set of
nodes, the sparsity factor between the two fused dimensions have to be the same for
all those nodes.
For instance, in the formula /i[z , k] — A[i , j ] x B [ i , j ] , suppose that A is dense and
B, and hence / i , have a single sparsity entry of (z, j , 0.1). We can fuse both the i- and
j-loops between B and / i and one of the two loops between A and f \ . Figure 3.18
shows how the fusion graph and the corresponding loop fusion configuration would
94
(a) (b)
Figure 3.19: Illegal fusion graphs due to representations of sparse arrays.
look like, in which n o n z e r o s l ( i ) is the set of j values such that is non-zero.
However, fusing both loops between A and f i is illegal because they have different
sparsity factors (and we assume the entire array .4 needs to be formed for other
purposes).
Internal representations o f sparse arrays could impose another constraint on legal
fusion graphs. If the representation allows sparse arrays to be accessed in any order
of the array dimensions efficiently, loops around the production or the consumption
of sparse arrays can be permuted freely to facilitate loop fusions. Otherwise, only a
single or a limited set of loop permutations around sparse arrays would be permitted,
thus restricting how loops can be fused. For example, if a two-dimensional sparse
array C must be accessed in row-major order, then the fusion graph in Figure 3.19(a)
would become illegal because it makes j , the second dimension of C, the outer loop.
Also, if a sparse array /a can be accessed in only one order, the fusion graph in
Figure 3.19(b) would again be illegal since /a is produced row-wise but consumed
column-wise.
95
fe
Sparsity entries in Y:
{j\k,S2) Y{ k j . s s )
(a)
for ifor jenonzerosl(i)
i J k l r for kenonzeros2(j)for l€nonzeros3(k)[ Y[k,l]=genYCi,j,k,l)
for kEnonzerosZCj) i j k I for 16nonzeros3(k)
[ f6 [ i , j .k ,l ]+= Y [k .l]
(b) (c)
Fused array sizeo f y =N k X N i X S2 X S3
(d )
Figure 3.20: Fused size of sparse arrays.
Fusing a loop, say a (-loop, between a node v for a sparse array and its parent
eliminates the (-dimension in the sparse array v, but does not always reduce the
array size (i.e., the number of non-zero elements^) by a factor of Nt- As an example,
consider the formula /s j , in which array Y has sparsity entries
{hJ: 5i ) , {j, k, S2) , and {k, I, S3) . If we choose to fuse only the i- and j-loops between
F and /e , the sparsity graph and the loop fusion configuration would be as shown in
Figure 3.20. Here, the fused size (i.e., the size after loop fusion) of F is the number of
non-zero elements in F when i and j are fixed, i.e., the number of non-zero elements
in the slice Y [ i , j , *, *]. In general, the fused size of a sparse array v equals the number
of non-zero elements in v when all the fused dimensions have fixed values. This fused
size can be computed as the product of the ranges of all unfused dimensions of v
tim es the product of the sparsity factors in the sparsity entries of v that involve at
least one unfused dimension. Formally, it can expressed as:
( J J N i ) X e. sparsity factor)i 6 V . dim ens — Set ( /) e&L
"Depending on the internal representation of sparse arrays, the actual memory usage could be higher.
96
fr FFTi iJ
X[K.i ] exp[î,j] Y
for K ■ for i
[ X[i] = ...fr[l :Nj]=fft(X[l:Ni] ) for j
. [ ...= fr[j]...Ki
(a) (b) (c)
Figure 3.21: Loop fusions for an FFT node.
where / is the fusion between v and its parent, and
L = (e E V.sparsityentries | e .d im l 0 Set ( / ) or e.dim2 0 Set ( / ) }
In our example, the fused size of Y is Nk x N[ x s ? x S3 .
3.8 Fast Fourier Transform
The general form of an FFT formula is fr [K , j ] = z] xexp[z, j] where AT is a
set of indices, j ^ K and exp[z, j] = Section 2.8). This formula can be
represented by an FFT node as shown in Figure 3.21(a). We assume that the FFTs are
formed by calling library routines, which com pute the exponential functions needed.
Thus, we do not consider the memory usage by exponential functions. Furthermore,
since the FFT library routines transform one or more entire rows of X to an equal
number of rows of fr at a time, the i- and j-loop s at an FFT node cannot be fused
with its children or its parent; only the loops for the indices in K can be fused.
Figure 3.21(b) and (c) show the fusion graph and the loop fusion configuration when
the loops in K are fused.
97
3.9 Further Reduction in M emory Usage
So far, we have been restricting our attention to reducing memory usage without
any increase in the number of arithm etic operations. However, w ith this restriction,
the optimal loop fusion configuration for some formula sequences may still require
more than the available memory. The existence o f common sub-expressions, sparse
arrays, and fast Fourier transforms in a formula sequence may prohibit the fusion of
some loops, thereby imposing high lower bounds on the sizes o f some intermediate
arrays. As a result, it may be necessary to relax the restriction on operation count,
for the implementation of some formula sequences to be feasible in terms of memory
usage.
As we have explained in Section 1.3, optimizing for operation count, memorj" us
age, and communication cost together in an integrated manner is intractable. For the
same reasons, we only consider the operation-optim al formula sequence when trading
operation count for memorj^ usage although another formula sequence may lead to
a better solution. In other words, when it is necessary to relax the restriction on
operation count to further reduce memory usage, our approach no longer guarantees
any optimality. Nevertheless, since we start with an operation-optim al formula se
quence, we expect the operation-relaxed solutions obtained to have good performance
in terms of memory usage and operation count.
We propose the following transformations to the formula sequence in question and
the corresponding potential fusion graph for further memory reduction at a cost of
increased arithm etic operations. The resulting formula sequence and the potential
fusion graph are then fed into one of the memory usage minim ization algorithms
described in Sections 3.3 and 3.5 to determine the optimal loop fusion configuration.
98
Creating additional loops around some assignment statements may enable more
loop fusions and thus reduce array sizes. For example, in Figure 3.2(b), if
we create an 2-loop around / t and and fuse it with the z-loop around .4
and f i , we can elim inate the z-dimension of / i and make it a scalar. This
increases the operation counts for and fs by a factor of Ni because they are re
compute Ni times. In general, we add additional vertices to the potential fusion
graph and connect each new vertex of a node v and the corresponding vertex
at the parent of v with an additional potential fusion edge. If the additional
potential fusion edge is converted into a fusion edge in a fusion graph, the
operation count for node v is multiplied by Ni, where i is the loop index for the
fusion edge. Note that if we place no limit on the operation count, the memory
usage minimization problem would have a trivial solution, which is to put all
the assignment statem ents inside a perfect loop nest. Doing so would reduce
all intermediate arrays to scalars but the operation count would return to its
original unoptimized value.
For a formula sequence w ith common sub-expressions, the fusion-preventing
constraints in Section 3.6 can be overcome by recomputing the common sub
expressions, once for each use o f a common sub-expression. We transform the
DAG and its potential fusion graph by splitting an n-parent node v (for n > I)
into multiple nodes v i , . . . ,Vn and make Vk the child of the k-th parent of v.
The subtrees or sub-DAGs rooted at v are also split into n copies. In a fusion
graph, the split nodes can be united into one if the fusions in their subtrees
or sub-DAGs are the same and if doing so does not violate the constraints in
Section 3.6. The operation count for node u or a descendant v' o f v is multiplied
99
by the number of split nodes for v or v' remaining after the split nodes are united,
if possible.
• As mentioned in Section 3.8, the loops for the two indices in the exp function
of an FFT node v cannot be fused with the corresponding loops in the parent
or the children of v. We can convert the FFT node back into a multiplication
node and a summation node to allow more possible loop fusions. This may
result in an increase in the operation count, which can be calculated according
to Section 2.2.
The memory usage minimization algorithms can be modified to maintain the
additional arithmetic operations for each solution. A solution s is inferior to another
solution s' if, in addition to existing criteria, the additional operations for s is greater
than or equal to that for s'. Any solution with a cumulative memory usage higher than
a user-specified limit can be pruned. At the end, the algorithms return a unpruned
solution for the root node that has the fewest additional arithmetic operations.
3.10 An Example
We illustrate the practical application of the memory usage minimization algo
rithm on the example multi-dimensional summation described in Section 2.9. The
optim al formula sequence for the summation has a cost of 1.89 x 10^ operations and
is reproduced below. The DAG representation of the formula sequence is shown in
Figure 3.22. In this example, array Y has a sparsity of 0.1 and the ranges of the
indices are as given in Section 2.9. W ithout any loop fusion, the total size of the
arrays is 1.13 x 10 elements.
fl[r,ElL,RLl,t] = Y[r,RL] * G[RLl,RL,t] cost= le+12 <r,RL,0.1>
1 0 0
h o FFTrl
/i3 FFTr exp[Gl,rl]
/ i i X exp[G, r]
/lO X X / 7
<>/ e T .R L 2 exp[fc ,r] e x p [fc ,r l]
f s X
[r,RL2]
[rl,RL2,t]
' H r l A
X / l
r[r, RL] G[RL1, RL, t]
Figure 3.22: The DAG representation of an exam ple formula sequence.
1 0 1
f2[r,RLl,t] = sum RL fl[r,RL,RLl,t] cost= 9.9e+ll dense f5[r,RL2,rl,t] = Y[r,RL2] * f2[rl,RL2,t] cost= le+14 <r,RL2,0.1>f6[r,rl,t] = sum RL2 f5[r,RL2,rl,t] cost= 9.9e+13 dense f7[k,r,rl] = exp[k,r] * exp[k,rl] cost= 0 denseflO[r,rl,t] = f6[r,rl,t] * f6[rl,r,t] cost= le+12 densefll[k,r,rl,t] = f7[k,r,rl] * flO[r,rl,t] cost= le+13 densef13[k,rl,t,G] = fft r fll[k,r,rl,t] * exp[G,r] cost=l.66G964e+15 dense f15[k,t,G,G1] = fft rl f13[k,rl,t,G] * exp[Gl,rl] cost=l.660964e+13 dense
Notice that the common sub-expressions F , exp, and /g appear at the right hand
side of more than one formula. Also, / 1 3 and / 1 5 are FF T formulae. As explained
in Sections 3.6, 3.7, and 3.8, if each array is to be computed only once, the presence
of these common sub-expressions and FFTs would prevent the fusion of some loops,
such as the r and r l loops between /g and /iq . Under the operation-count restriction,
the optimal loop fusion configuration obtained by the memory* usage minimization
algorithm for static memory allocation requires memory storage for 1 . 1 0 x 1 0 ^ array
elements, which is 1000 tim es better than without any loop fusion. But this translates
to about 1 1 0 gigabytes and probably still exceeds the amount of memory available in
any computer today. Thus, relaxation of the operation-count restriction is necessary
to further reduce to memory usage to reasonable values.
We perform the following simple transformations to the DAG and the correspond
ing potential fusion graph (see Section 3.9).
• Two additional vertices are added: one for a A:-loop around fiQ and the other
for a (-loop around f j . These additional vertices are then connected to the
corresponding vertices in f n with additional potential fusion edges to allow
more loop fusion opportunities between f n and its two children.
1 0 2
• The common sub-expressions Y, exp, and fe are split into multiple nodes. Two
copies o f sub-DAG rooted at fs are also made. This will overcome the require
ments on legal fusion graphs for DAGs discussed in Section 3.6.
The memory usage minimization algorithm for static memory allocation is then
applied on the transformed potential fusion graph. The loop fusion configuration,
the fusion graph, and the memory usage and operation count statistics o f the optimal
solution found are shown in Figure 3.23. For clarity, the input arrays are not included
in the fusion graph. The memory usage of the optimal solution after relaxing the
operation-count restriction is significantly reduced by a factor of about 1 0 0 to 1 . 1 2 x
10 array elements. The operation count is increased by only around 10% to 2.10 x
10^ . Compared with the best hand-optimized loop fusion configuration, which also
has some manually-applied transformations to reduce memorv' usage to 1 . 1 2 x 1 0 ®
array elements and has 5.08 x 10 operations, the optimal loop fusion configuration
obtained by the algorithm shows a factor o f 2.5 improvement in operation count while
using the same amount o f memory.
103
for r for RL[ Y[r,RL]=genYCr,RL]
for tinit f2 for RL
for RL2G=genG(RL2,RL,t ) for rlr fl=YCrl,RL]-«G [ f2[rl,RL2]+=fl
for rl for r
init f6,f6’ for RL2
f5=Y[r,RL2]»f2[r1.RL2] f6+=f5f5’=Y[r1.RL2]»f2[r,RL2] f6’+=fS’
fl0[r]=f6*f6’ for k
for rr f7=erp[k.r]»exp[k,rl][ f 11 [r] =f7*f 10 [r] fl3[k,rl,l:KG]=fft(fll[l:Kr])
fl5[l:Hk.l:NG,l:NGl]=fft(fl3[l:Nk.l:Hrl,l:NG]) . write fl5[l:fîk,l:HG,l:NGl]
C t k C l
fe
fô
r l RL2R L l t
fr
(a) Loop fusion configuration (b) Fusion graph
Array Array Memory Operationdimensions usage count
G - 1 -
Y r,RL 1 0 ^f i - 1 1 0 'f2 r l ,R L l ,R L 2 1 0 * 1 0 ^fô - 1 lO 'fô - 1 1 0 *fe - 1 lO '*fe - 1 10 4/7 - 1 1 0 1 *fio r 1 0 ^ 1 0 1 %f l l T 1 0 = 1 0 1 *fl3 k ,G , r l 1 0 ^ 1 . 6 6 X 1 0 1 *flô k,G, G1 1 0 ^ 1 .6 6 X 1 0 1 *
Total 1 .1 2 X 1 0 ^ 2 . 1 0 X 1 0 *
(c) Memory usage and operation count
Figure 3.23: Optimal loop fusions for the example formula sequence.
104
CHAPTER 4
COM M UNICATION M INIM IZATION
Given a sequence o f formulae, we now address the problem of finding the optimal
partitioning of arrays and operations among the processors in order to minimize inter
processor communication and computational costs in im plem enting the computation
on a message-passing parallel computer. Section 4.1 describes a multi-dimensional
processor model and characterizes the communication and com putational costs. Sec
tion 4.2 shows how the multi-dimensional processor model can be applied to analyze
the communication and com putational costs of matrix m ultiplication on parallel com
puters. Section 4.3 presents a dynamic programming algorithm for the communication
minimization problem. An example illustrating the use of the implemented algorithm
is provided in Section 4.4. The modifications to the algorithm for handling common
sub-expressions, sparse arrays, and FFT are discussed in Sections 4.5, 4.6 and 4.7,
respectively. Section 4.8 integrates the problems of communication minimization and
memory usage m inimization and proposes two approaches for determining the distri
bution of data arrays and com putations among processors and the loop fusions that
minimize inter-processor communication for load-balanced parallel execution, while
not exceeding the memory limit.
105
s Ej
/3 X
x \I l E l / 2 E t
A[z,j,£] B\j,k ,t ]
Figure 4.1; An exam ple expression tree.
4.1 Preliminaries
We use a logical view of the processors as a multi-dimensional grid, where each
array can be distributed or replicated along one or more of the processor dimensions.
Let pd be the number of processors on the d-th dimension of an n-dimensional pro
cessor array, so that the total number of processors is pi x p2 x . . . x p^. We use
an n-tuple to denote the partitioning or distribution of the elements of a data array
on an n-dimensional processor array. The d-th position in an n-tuple a, denoted
Q[d], corresponds to the d-th processor dimension. Each position may be one of the
following: an index variable distributed along that processor dimension, a denot
ing replication of data along that processor dimension, or a ‘1 ’ denoting that only
the first processor along that processor dimension is assigned any data. If an index
variable appears as an array subscript but not in the n-tuple, then the corresponding
dimension of the array is not distributed. Conversely, if an index variable appears in
the n-tuple but not in the array, then the data is replicated along the corresponding
processor dimension, which is the same as replacing that index variable with a
106
As an example, consider the expression tree shown in Figure 2.1 and reproduced
in Figure 4.1. Suppose 64 processors form a 2 x 4 x 8 array. For the 3-dimensional
array B [ j ,k , t ] , the 3-tuple (&\*,1) specifies that the second dimension of B is dis
tributed along the first processor dimensions, the first and third dimensions of B
are not distributed, and that data are replicated along the second processor dimen
sion and are assigned only to processors whose third processor dimension equals 1 .
Thus, a processor whose id is Pzi,z2,zz will be assigned a portion of B specified by
B[1 : N j,myrange{z i , Nk-:P\),1 : Nt)] if Z3 = 1 and no part of B otherwise, where
myrange{z, N . p ) is the range (z — 1 ) x N / p -I- 1 to z x N /p .
We assume the SPMD programming model and do not consider distributing the
computation o f different formulae on different processors. Since a formula sequence
can be represented by an expression tree and each node in the expression tree corre
sponds to an array, we use the terms ‘node’ and ‘array’ interchangeably and sometimes
refer to the array corresponding to a node v as array v.
A child array is redistributed before the evaluation of its parent if their distribu
tions do not match. For instance, suppose the arrays f i [ j , t\ and / 2 b, i] in Figure 4.1
have distributions { l . t . j ) and (j, *, 1 ) respectively. If we want fz to have distribution
{ j . t , 1 ) when evaluating fs l j . t ] = f i [ j , t ] x / 2 b',i], f i would have to be redistributed
from (1, t, j ) to { j , t , 1) because the two distributions do not match. But for / 2 to go
from {j, *, 1 ) to { j , t , 1 ), each processor just needs to give up part of the (-dimension
of the array and no inter-processor data movement is required.
107
The number of processors or processor groups holding distinct parts of an array
V w ith distribution a is given by:
DistFactor{v, q ) = pda \d \ € v.dim ens
For example, if the array B[j\ k, t] has distribution {k, *, 1 ), then there are only pg = 4
processor groups having distinct parts o f B.
Let Mcost(localsize, a , (3) be the communication cost in moving the elements o f an
array, with localsize elements distributed on each processor, from an initial distribu
tion a to a final distribution (3. It depends on several factors such as the underlying
processor topolog}', the amount of node and link contentions, and the message routing
mechanism employed. We empirically measure Mcost for each possible non-matching
pair o f a and (3 and for several different localsizes on the target parallel computer.
Let MoveCost{v, a , 13) denote the communication cost in redistributing the elements
of array v from an initial distribution a to a final distribution [3. It can be expressed
as:
where
MoveCost{v, a , /), / ) = Mcost{DistSize{v, a), a , ,3)
DistFactor{v, a)
is the number of elements of array v distributed on each processor.
Let CalcCost{v^ 7 ) be the computational cost in calculating an array v with 7 as
the distribution of v. For multiplication and for summation where the sum m ation
index is not distributed, the computational cost for v can be quantified as the total
number of operations for v divided by the number of processors working on distinct
108
/ 2 E t
f l X
/ XA[î,fc] B[k,j]
Figure 4.2: Expression tree representation of m atrix m ultiplication.
parts of V. In our example, if the array fz l j . t ] has distribution (j, t, I), its com
putational cost would be Nj x N tlv r lV i operations on each participating processor.
Formally,
where v.indices = v.dimens\J v.sumindex is the set of loop indices around the com
putation of V.
For the case of sum m ation where the summation index i = v.sumindex is dis
tributed, partial sums of v are first formed on each processor and then either consol
idated on one processor along the i dimension or replicated on all processors along
the same processor dimension. We denote by CalcCostl and M oveCost l the compu
tational and com m unication costs for forming the sum w ithout replication, and by
CalcCost2 and MoveCost2 those with replication.
4.2 Application to M atrix M ultiplication
Matrix m ultiplication is a simple application in which the above multi-dimensional
processor model can be applied to analyze the communication and com putational cost
on parallel computers. The standard form C [i , j ] = A:] x B [ k , j \ ) of matrix
109
multiplication can be rewritten as the following formula sequence:
/i[z , j-. k] = A[2, B [ k J \
C [ t j \ = f2[h j] = XI / i j: k]k
and represented as the expression tree shown in Figure 4.2.
Several existing m atrix m ultiplication algorithms for parallel computers fit well
into this model. One of them [31] is a simple parallel implementation o f the serial block
m atrix multiplication algorithm. In this simple parallel algorithm, the processors form
a 2-dimensional array. Initially, arrays A and B are fully block-distributed along both
processor dimensions. The initial distribution can be defined by the 2-tuples (z, k) for
array A and {k , j ) for array B. In order for each processor F ,j to acquire all the data
it needs to compute the sub-block C i j of the result matrix, sub-blocks of .4 and B
are then broadcast along the second and the first processor dimensions respectively.
The 2-tuples for the data distribution after broadcast is which is equivalent
to (z, *) for A and (* , j ) for B. The costs of broadcasting .4 and B are denoted as
MoveCost{A, ( i ,k) , and MoveCost{B, (k , j ) , (z ,i) ) respectively.
As the distribution is now identical for both A and B, multiplication can take
place. The distribution of the multiplication operation is again (z, j ) (note that the
k dimension is not distributed) and the cost of this operation is CalcCost{fi ,
The intermediate result / i [z, j , A:] now resides on P i j and hence has the same distri
bution tuple The final step is to add up the products on each processor. The
distributions of the sum m ation operation and the result array C are both (z, j ) . The
cost of the summation is C alcCos t l { f 2 , No further data movement is required.
no
Hence, the total execution time of this algorithm is represented as:
MoveCost{A, (z, k), (z, j ) ) + MoveCost{B, (k , j ) ,CalcCost(fi, ( i . j ) ) + CalcCostl{ f2 ,
Another matrix multiplication algorithm that fits into our model is known as the
DNS algorithm [31], which is based on a 3-dimensional processor view. The source
arrays A and B are initially distributed the same way on the bottom processor plane
where the third processor dimension is 1 . Thus, their initial distributions can be
specified by the 3-tuples (z. A:, 1) for array A and ( k , j , l ) for array B. Then, the
elements of these two arrays are broadcast to all other processors in such a way that
processor Pij,k will have A[i,k] and B [k , j ] . This intermediate data distribution is
described by { i , j , k ) , which is the sam e as ( i ,* ,k ) for array .4 and (*-,j,k) for array
B . The cost of redistributing A and B are denoted as MoveCost{A, (z, k, 1 ), { i , j , k))
and MoveCost{B , {k , j , I), (z, j, k)) respectively.
Now that both arrays have the same distribution, the multiplication step can be
carried out under the same loop distribution { i , j , k). The intermediate result A:]
represents the product now stored in Pij,k and hence has the same data distribution
( i , j , k ) . The computational cost in forming / i is denoted as CalcCost{fi , {i, j , k)).
Finally, the sums C[i, j] for all z and j are formed by single-node accumulation
of the products along the third processor dimension, and C[i , j ] ends up in Pijfi.
The loop distribution for the sum m ation step is again ( i , j , k ) , but the data dis
tribution of C is (z,y, 1 ). In this step, the computational and communication costs
are CalcCostl { f2 , { i , j , k)) and MoveCost2{f2, (z, j, k), { i , j \ 1 )) respectively. Therefore,
the total execution time can be expressed as:
MoveCost{A, (z, k, 1 ), (z, j . A:)) -f- MoveCost{B, {k ,j , 1 ), { i , j , k ) ) +CalcCost{fi, { i , j , A:)) -f CalcCostl { f2 , { i , j , k)) + MoveCost2{f2, (z, j , k), (z, j . I))
1 1 1
4.3 A Dynam ic Programming Algorithm
We assume the input arrays can be distributed initially among the processors
in any way at zero cost, as long as they are not replicated. We do not require the
final results to be distributed in any particular way. Our approach works regardless of
whether any initial or final data distribution is given. If all data arrays and loop nests
have the same distribution n-tuple (for an n-dimensional processor array), no data
movement among the processors will be required during execution. This is achievable
if and only if there exist n indices that appear in the index sets o f every data array
and every- loop nest. W'hen this condition cannot be satisfied, we need to determine
the combination of the distribution n-tuples for the data arrays and the loop nests
that minimizes the com putational and communication costs.
A dynamic programming algorithm that determines the distribution o f dense ar
rays to minimize the com putational and communication costs is given below.
1. Transform the given sequence of formulae into an expression tree T (see Sec
tion 2 .2 ).
2. Let Cost(v, a ) be the minimal total cost for the subtree rooted at v w ith distri
bution a . Initialize Cost{v, a) for each leaf node v in T and each distribution
a as follows:
_ f 0 if NoRepl ica te(a)o s [ v , a ) - ^ minA:oRepaca(e(/3){MoueCos((u, /), a ) } otherwise
where NoRepl ica te{a) is a predicate meaning a involves no replication.
3. Perform a bottom -up traversal of T. For each internal node u and each distri
bution a , calculate Cost{u, a ) as follows:
1 1 2
Case (a): u is a m ultiplication node with two children v and v'. We need both
V and v' to have the same distribution, say 7 , before u can be formed. ,A.fter
the m ultiplication, the product could be redistributed if necessary. Thus,
Cost{u, a) = nnn{C'osi(u, 7 ) 4- Cost{v', 7 ) + CalcCost(u, 7 ) + MoveCost{u, 7 , a )}
Case (b): u is a summation node over index i and with a child v. v may have
any distribution 7 . If i 6 7 , each processor first forms partial sums of u and
then we either combine the partial sums on one processor along the i dimension
or replicate them on all processors along that processor dimension. Afterwards,
the sum could be redistributed if necessary. Thus,
Cost{u ,a ) = miiLy{Cost{v, j) + min{CalcCostl {u, j ) + MoveCostl {u ,'y ,a) ,CalcCost2{u, 7 ) 4- MoveCost2{u, 7 , a ) ) }
In either case, save into Dist{u, a) the distribution 7 that minimizes Cost{u, a) .
4. W hen step 3 finishes for all nodes and all indices, the minimal total cost for
the entire tree is mina[Cost{T.root, q )} . The distribution a that minimizes the
total cost is the optim al distribution for T.root. The optim al distributions for
other nodes can be obtained by tracing back Dist{u, a ) in a top-down manner,
starting from Dist{T.root, a).
The running tim e com plexity of this algorithm is 0{q^\T\) , where |T| is the number
of internal nodes in the expression tree and q = 0 (m") is the number of different
possible distribution n-tuples, and m is the number of index variables. The storage
requirement for Cost{u ,a) and Dis t{u ,a) is 0 {q \T \ ) .
113
4.4 An Example
The above algorithm for dense arrays has been implemented. As an illustration,
we apply it to a triple matrix multiplication problem specified by the following input
file to the program:
fl[i,j,k] = A[i,j] * B[j,k] f2[i,k] = sum j fl[i,j,k] f3[k,l,m] = C[k,l] * D[l,m] f4[k,m] = sum 1 f3[k,l,m] f5[i,k,m] = f2[i,k] * f4[k,m] f6[i,m] = sum k f5[i,k,m] i 200 j 96000 k 400 1 84000 m 200 end
The first six lines specify the sequence of formulae. The next five lines provide
the ranges of the index variables. Note that the matrices are rectangular. The target
parallel machine is a Cray T3E at the Ohio Supercomputer Center. We empirically
measured the processor speed for the computation kernel (found to be about 400 M-
fiops) and MCost (which is used to calculate MoveCost for each possible pair of initial
and final distributions for several different message sizes. These measurements are
given as auxiliary input to the program. Eight processors viewed as a logical two-
dimensional 2 x 4 array are specified.
The optim al distribution of the arrays that minimizes the total computational
and communication tim e as generated by the program is shown in Table 4.1. The
appearance of two n-tuples under the distribution column indicates redistribution of
the array. CalcCost and MoveCost are expressed in seconds. For /g and / 4 , the partial
sums are not replicated.
114
Array Size Distribution ( 7 —>• a ) CalcCost{u, 7 ) MoveCost{u. 7 , a )^[U j] 1.92 X 10' ( i , j ) - 4 (* ,; ) 0.000 0.793B [j .k \ 3.84 X 10' <^,J> 0.000 0.000C[kA] 3.36 X 10^ 0.000 0.000D[l, m] 1.68 X 10' (m,Z) - 4 (*,Z) 0.000 0.694
7.68 X 10 2.400 0.000f2[i.k\ 8.00 X 10‘‘ ( k j ) -4 (z, *) 2.400 0.024h [ k ,L m ] 6.72 X 10 (A:,/) 2.100 0.000f 4 [k,m] 8.00 X lO'* (k J ) -4 (* ,m ) 2.100 0.024f5 [ i ,k ,m \ 1.60 X 10" (z, m) 0.005 0.000fe[i, m] 4.00 X 10“* (z, m) 0.005 0.000Total time 9.010 1.535
Table 4.1: Optimal array distributions for an example formula sequence.
4.5 Common Sub-Expressions
W ith the introduction of common sub-expressions, a formula sequence must be
represented as a directed acyclic graph (DAG) instead o f an expression tree since each
common sub-expression appears as an operand in more than one subsequent formulae
and has multiple parents. If the algorithm for a tree (in Section 4.3) is applied on
a DAG, problems such as cost double-counting, distribution mismatch, and wrong
solution may arise due to the multi-parent nodes. We want to be able to compute
each intermediate array only once and use it multiple times as needed (possibly with
multiple redistributions).
Let u be a multi-parent node. We propose the following changes to the algorithm.
• To each parent of u, we add an extra MoveCost cost for the redistribution of
the V.
115
• The minimization over the distribution o f v is not performed at its parents,
but rather at the dominator node o f v, denoted DOM(v) (which is the closest
ancestor of v that every path from v to the root must pass through).
• For each node v' on any path from v to DOM {v), we keep a separate Cost{v', a)
for each possible distribution of v.
To illustrate the changes, consider the following formula sequence in which f i has
2 parents.
f i [z, A:] = A[i, j ] X B [j, A:]
f2[h i, k] = f i [z,i, k] X C[z, A:]
= f i [ j , k , i ] ^ f 2 [ i j , k ]
The equations for finding the lowest cost are:
C o s t{ f i ,a ) = min{Cost(A,''f) + Cost^B.-y) + C alcCost{fi,-f)-h
M oveCost(fi, 7 , a )}
C ost{f2 , 0) |(/i,a) = min{C'o5 f ( / i , a ) + M o ve C o s t{ f i ,a , /?) + Cost{C, ,3) +
CalcCost{f2 , (5) + M oveC ost{f2 , 3 ,9 ) }
Cost{fz, (j>) = m m {nun{C 'osf(/i, a ) + M oveC ost(fi, a , 6) + Cost{f2 , 6) |(/i,q)}
+ CalcCost{fz, 6) + MoveCost{fzj 9, 4>)}
Note that the com plexity of the revised algorithm is now 0(g^'^^|T|), where |T | is
the number of nodes in the DAG, q is the number of different possible distributions,
and t is the number of multi-parent nodes that are ‘open’ at a time.
A DAG also leads to another com plication called the Steiner tree effect [6 6 ], i.e.,
the distributions of the parent nodes of a multi-parent node v may be obtained with
116
a lower cost by including more ‘transit’ nodes between v and its parents. The revised
algorithm has taken care of this effect for 2-parent nodes. To obtain optim al solutions
for nodes with more than 2 parents, more Steiner trees with more M oveCosts have to
be considered.
4.6 Sparse Arrays
A sparse array is said to be evenly distributed among the processors if an equal
number of array elements is assigned to each processor. We do not consider un
even distribution of sparse arrays as it would lead to load imbalance and probably
sub-optimal performance. W ith the uniform sparsity assumption, a sparse array is
guaranteed to be evenly distributed if no two distributed array dimensions appear in
the same sparsity entry or are reachable from each other in the sparsity graph (see
Section 2.7). .A.S an example, if the array v [ i , j ,k J \ has sparsity entries ( i , j , 0.1) and
{j, k, 0.1), then at most one of the 3 indices i, j and k would be distributed: otherwise,
uneven distribution would result.
Since zero elements in sparse arrays do not participate in com putation or data
movement, the array size component of the computational or com m unication costs
for sparse arrays equals the number of non-zero elements. In other words,
CalcCost{v, a ) = CalcCost{v', a)
M oveC ost(v ,a , ,8) = M oveCost{v',a, P)
where v' is a dense array which has the same number of non-zero elem ents as v. These
formulae for CalcCost and M oveCost of sparse arrays are exact unless the indices
assigned to a processor before and after redistribution of v are mutually reachable in
the sparsity graph, in which case M oveCost would be an approximation.
117
4.7 Fast Fourier Transform
Since exponential functions are computed on the fly (by FFT routines), they are
neither stored as arrays nor moved between processors and the costs of forming them
are usually absorbed into the FFT costs. Thus, we can simplify a DAG by replacing
a multiplication node whose children are exponential functions by an exponential
function leaf node.
An FFT formula introduces into a DAG a new kind of node called an FFT node,
which has the sum m ation index as its label and the operand array and an exponential
function as its two children. Let u be an FFT node with summation index i and
operand arrays v{K ,i] and exp[z, j], and let 7 be the distribution of v. The minimal
total cost for the DAG rooted at u with distribution a is evaluated as follows. If
i 0 7 , each processor independently performs serial FF T s on its local portion of v;
otherwise, group(s) o f processors perform parallel FFTs collectively on their portions
of V. Afterwards, u may be redistributed if necessary. Hence,
Cost{u,a) =
{ miny{C'osf(u, 7 ) + CalcCost3{u, 7 ) -t- MoveCost{u, 7 , a) if z 0 7
miiLy{Cost[v,j) -t- CalcCost4{u,'y) + MoveCost4{u,'y) + MoveCostlu.y ,a) otherwise
where CalcCostS is the computational cost for forming the serial FFTs, CalcCost4
and MoveCost4 are the computational and communication costs for forming the par
allel FFTs, and 7 ' is 7 w ith i replaced by j .
4.8 Communication Minimization with M emory Constraint
Given a sequence of formulae, we now address the problem of finding the optimal
partitioning of arrays and operations among the processors and the loop fusions on
each processor in order to minimize inter-processor communication and computational
118
costs while staying within the available memory in implem enting the computation on
a message-passing parallel computer. Section 4.8.1 discusses the combined effects of
loop fusions and array/operation partitioning on communication cost, computational
cost, and memory usage. Two approaches for solving this problem are presented in
Section 4.8.2.
4.8.1 Preliminaries
The partitioning of data arrays among the processors and the allowable fusions
of loops on each processor are inter-related. For the fusion of a t-loop between nodes
u and V to be possible, that loop must either be undistributed at both u and u, or
be distributed onto the same number of processors at u and at v (but not necessarily
along the same processor dimension). Otherwise, the range of the t-loop at node
u would be different from that at node v. As an example, suppose 128 processors
form a 2 x 2 x 4 x 8 array. Consider the expression tree shown in Figure 3.1(c) and
reproduced in Figure 4.3. If array B[j, k, I] has distribution (/, k, *, 1 ) and fusion (jl)
with f 2 [j, k, I], then /g can have distribution { l , j , *, k) but not (A:, Z, / , 1 ) because the
fusion {jl) forces the j-dim ension of /a to be distributed onto 2 processors and the
Z-dimension to be undistributed. Array partitioning and loop fusion also have effects
on memory usage, communication cost, and com putational cost.
Fusing a Z-loop between a node v and its parent elim inates the Z-dimension of
array v. If the Z-loop is not fused but the Z-dimension o f array v is distributed along
the cZ-th processor dimension, then the range of the Z-dimension of array v on each
processor is reduced to Nt/pd- Let D istS ize{v ,a , f ) be the size on each processor of
119
/ô Hi
h Xx \f l H: /a H t
/ 2 Xy \Figure 4.3: The expression tree in Figure 3.1(c).
array v, which has fusion / with its parent and distribution a . We have
DistSize{v, a , f ) = JJ DistRange{i, v, a , Se t{f))i € V . dimens
where v.dimens = v.indices — {y.sum index} is the array dimension indices of v before
loop fusions, and
1 if 2 E r
^i/Pd if z 0 % and i = a[d]Ni if i 0 X and i 0 a
In our example, assume again that N = 500, Nj = 100, iV = 40, and Nt = 15. If the
array B[j, k, I] has distribution {j\ k, *, 1 ) and fusion ( jl ) w ith /a, then the size of B on
each processor would be A^t/2 = 20 since the A:-dimension is the only unfused dimen
sion and is distributed onto 2 processors. Note that if array v undergoes redistribution
from a to /?, the array size on each processor after redistribution is DistSize{v, P , f ) ,
which could be different from D istS ize{v ,a , f ) , the size before redistribution.
The initial and final distributions of an array v determines the communication
pattern and whether v needs redistribution, while loop fusions change the number
120
DistRange{i, v, a , z ) —
o f tim es array v is redistributed and the size o f each message. Let v be an array
that needs to be redistributed. If node v is not fused with its parent, array v is
redistributed only once. Fusing a t-Ioop between node v and its parent puts the
collective communication code for redistribution inside the (-loop. Thus, the number
of redistributions is increased by a factor o f N t/pd if the (-dimension of v is distributed
along the d-th processor dimension and by a factor of Nt if the (-dimension of v is
not distributed. In other words, loop fusions cannot reduce communication cost.
Continuing with our example, if the array B [j, k, /] has fusion {jl) with / 2 and needs
to be redistributed from (j, fc, *, 1) to (fc, j , *, 1), then it would be redistributed N jj2 x
Ni = 750 times.
Let Mcost{localsize, a , P) be the com m unication cost in moving the elements of
an array, with localsize elements distributed on each processor, from an initial distri
bution a to a final distribution 8. We em pirically measure Mcost for each possible
non-matching pair of a and P and for several different localsizes on the target parallel
computer. Let M oveC ost{v ,a , P, f ) denote the communication cost in redistributing
the elements of array v, which has fusion / w ith its parent, from an initial distribution
a to a final distribution p. It can be expressed as:
M oveC ost{v ,a , P, f ) = M sgFactor{v,a, S e t{ f ) ) x M cost{DistSize{v,a, S e t { f ) ) ,a , P)
where
MsgFactoriv, a , x ) = LoopRange{i, v, a , x)i € V . dimens
LoopRange{i, v, a , x)1 i f i X
N i / p d i f i Ç. X and i = a [ d \
Ni if i G X and i a
121
Note that com putatioaal cost CalcCost is unaffected by loop fusions (see Sec
tion 4.1).
4.8.2 Two Approaches
In Section 4.3, we have solved the communication minimization problem but with
out considering loop fusion or memory usage. In practice, the arrays involved are
often too large to fit into the available memory even after partitioning among the
processors. We now present two approaches of extending the framework developed
in Sections 3.1 to 3.5 to the parallel computing context to solve the communication
minimization problem with memory constraint. We assume the input arrays can be
distributed initially among the processors in any way at zero cost, as long as they are
not replicated.
Our first approach is to find an optimal loop fusion configuration from among
all the communication-optimal distribution configurations so that the memory usage
is below the limit but without increasing the communication cost. It decouples the
communication minimization problem and the memory usage minimization problem.
This approach has two phases. The first phase is to determine all distribution
configurations that minimize the computational and communication costs. A dynamic
programming algorithm for this purpose is given below. It assumes that no loops are
fused. The algorithm is basically the same as the one in Section 4.3 except that we
now keep the set of all optim al solutions instead of one optimal solution for each node
in the expression tree.
1 . Transform the given sequence of formulae into an expression tree T (see Sec
tion 2.3).
122
2. Let Cost{v, a) be the minimal total cost for the subtree rooted at v with distri
bution Q. Initialize Costiy, a ) for each leaf node v in T and each distribution
a as follows:
^ \ _ f 0 if jVoReplicate{a)\ mmrsroRepiicate(P){MoveCost{v, l3,a,<d)} otherwise
where N oR eplica te{a) is a predicate meaning a involves no replication.
3. Perform a bottom-up traversal of T. For each internal node u and each distri
bution a , calculate Cost{u, a ) as follows:
Case (a): u is a multiplication node with two children v and v'. We need both
V and v' to have the same distribution, say 7 , before u can be formed. After
the multiplication, the product could be redistributed if necessary. Thus,
Cost{u, a) = niin{ C'ost(u, ''f)+Cost{v', j)-hCalcCost{u, ''f)+MoveCost{u, 7 , or, 0)}
Case (b): u is a summation node over index i and with a child v, which may
have any distribution 7 . If i G 7 , each processor first forms partial sums of
u and then we either combine the partial sums on one processor along the i
dimension or replicate them on all processors along that processor dimension.
Afterwards, the sum could be redistributed if necessar}^ Thus,
Cost{u, a ) = min^{Cost{v, 7 ) -f- m m {CalcCostl{u , 7 ) -j- M oveCostl{u, 7 , a , 0), CalcCost2{u, 7 ) -f- MoveCost2{u, 7 , a , 0 ))}
In either case, save into DistSet{u, a) the set of distributions 7 that minimizes
Cost{u, a ).
4. When step 3 finishes for all nodes and all indices, the minim al total cost for the
entire expression tree T is mm.a{Cost{T.root, a )} . The distributions a that min
imizes the total cost for T are the optimal distributions for T.root. The optimal
123
distributions for other nodes can be obtained by tracing back DistSet{u, a) in
a top-down manner, starting from DistSet{T .rooLa) for each a.
The running tim e com plexity of this algorithm is 0{q^\T\), where |T| is the number
of nodes in the expression tree, q = 0 {m ^ ) is the number of different possible dis
tribution n-tuples, and m is the number of index variables. The storage requirement
for C ost{u ,a) is 0 {q \T \) , and for DistSet{u, a ) is 0{q^\T\).
The outcome of the first phase is a set o f array distribution configurations that
have the same minimal total communication and com putational cost. These opti
mal distribution configurations are then fed into the second phase to find a loop
fusion configuration that uses no more than the available memory on each processor
without increasing the communication cost. One of the memory-optimal loop fusion
algorithms in Section 3.3 and 3.5 (depending on the memory allocation model) can
be used in the second phase, but with the following changes:
1 . To insure the communication cost stays minimal, if an array v is redistributed,
then V must not be fused with its parent. Arrays that are not redistributed can
be fused freely with their parents since they do not contribute to any commu
nication cost. The I n itF u s ib le procedure can be easily modified for this.
2 . In calculating the array sizes on each processor, the DistSize function in Sec
tion 4.8.1 is used in place of the FusedSize function so that the effect of distri
bution on array sizes is accounted for.
3. If the size of an array before and after redistribution is different, the higher of
the two should be used in determining memory usage.
124
Since the goal is to stay within the available memory rather than to minimize
memory usage, we do not need to apply the algorithm on all the optimal distribution
configurations. We can stop as soon as we find the memory usage of an optimal
loop fusion for a distribution configuration is below the limit. In case the number
of optimal distribution configurations is large, the following pruning methods can be
applied.
1. Any distribution configuration with a redistributed array whose size is larger
than the available memory can be pruned.
2. For static memory allocation, any distribution configuration in which the total
size of all redistributed arrays is above the memory limit can be pruned.
3. If several distribution configurations have the same set of redistributed arrays,
all but one of them can be pruned. This is because they have the same set of
fusible loops and hence the same minimal memor}' usage.
4. Instead of processing one distribution configuration at a time, apply the algo
rithm on the optim al distributions for each array v and prune the distributions
for the children of v that have inferior memorv' usage.
The advantage of the first approach is its efficiency as it only searches among
the optimal distribution configurations. If it finds a memory-optimal loop fusion for
an optimal distribution configuration that uses no more memory than available, the
solution must be optim al because no other solution could possibly use less memory
than an optimal distribution configuration. But if none of the optimal distribution
configurations can be fused to stay within the memory lim it, the first approach finds
no solution.
125
The second approach is to search among all combinations of loop fusions and array
distributions to find one that has minimal total communication and computational
cost and uses no more than the available memory. It is also a bottom-up. dynamic
programming algorithm. At each node v in the expression tree T, we consider all legal
combination of array distributions for v and loop fusions between v and its parent. A
combination is illegal if there exists a (-loop fused between v and its parent and the
(-dimension of the array v is distributed onto a different number of processors before
and after redistribution. The array size, communication cost, and computational
cost are determined according to the equations in Section 4.8.1. At each node u,
a set o f solutions is formed. Each solution contains the final distribution of v, the
loop nesting at v, the loop fusion between v and its parent, the total communication
and computational cost, and the memory usage for the subtree rooted at v (which
is a mem field for static allocation, or a seqset structure for dynamic allocation). A
solution s is said to be inferior to another solution s' if they have the same final
distribution, s.nesting Ç s' .nesting, s.totalcost > s' .totalcost, and the memory usage
of s is inferior to s' (as defined in Sections 3.3 and 3.5). An inferior solution and any
solution that uses more memory than available can be pruned. At the root node of
T, the solution(s) with the lowest total cost is the optimal solution for the entire tree.
This approach is exhaustive. It always finds an optimal solution if there is one.
The algorithm can be easily modified to minimize memory usage as well.
In comparison, the first approach is more efficient but is not guaranteed to find
a solution while the second approach has such a guarantee but is slower. Hence,
we suggest that the first approach should be applied first. Only if it cannot find a
solution do we use the second approach.
126
C H A PTER 5
CONCLUSIONS
In this dissertation, we have addressed performance optim ization issues for the
parallel execution of a class of nested loops that arise in the context of some com
putational physics applications modeling electronic properties of semiconductors and
metals. The com putation essentially involves a multi-dimensional summation (or dis
cretized integral) of a product of a number of array terms. Besides the typical parallel
computing considerations of mapping of the data and com putations onto the proces
sors, two other performance aspects also require attention. One aspect is the total
number of arithm etic operations, which can be reduced by judiciously restructuring
the computations through use of algebraic properties of commutativity, associativity,
and distributivity. The other aspect is memory usage, which is a function of loop
fusion and loop reordering.
We have proved that the problem o f finding a sequence of nested loops that com
putes a given multi-dimensional summation using a minimum number of arithmetic
operations is NP-com plete. An efficient pruning search strategy to determine the op
timal restructuring of the computations has been provided. We have implemented the
operation m inim ization algorithm and have used it to obtain significant improvement
127
in operation count for self-energy electronic structure calculations in a tight-binding
scheme.
We have considered the problem of seeking a loop fusion configuration that min
imizes memory usage. Based on a framework that models loop fusions and memor}-
usage, we have presented algorithms that solve the memory usage minimization prob
lem under both static memory allocation and dynam ic memor}- allocation. Several
ways to further reduce memory usage at the cost o f higher number of arithmetic
operations were described.
The optim al mapping of the data and com putations to minimize total inter
processor communication and com putational costs has been addressed. We have
proposed a framework for describing the partitioning of arrays among processors
and for analyzing the amount of data movement between processors under a multi
dim ensional processor view. A dynamic programming algorithm for finding an opti
mal partitioning of data and operations among processors was given. It was shown
that different parallel algorithms for matrix m ultiplication correspond to instances
that would autom atically be evaluated in the proposed framework. We have also
described two approaches for finding a loop fusion and array distribution cod figura
tion that minimizes communication and com putational cost while staying within the
available memory.
In practice, some arrays could be sparse and there also is an opportunity to reuse
com m on sub-expressions in computing a m ulti-dimensional summation. Moreover,
some sub-com putations in summations involving exponential functions can be com
puted more efficiently with fast Fourier transforms. These three characteristics were
addressed in the dissertation and generalizations to the algorithms for the operation
128
minimization problem, the communication m inimization problem, and the memo 13-
usage m inim ization problem were proposed.
5.1 Research Topics for Further Pursuit
Building upon the foundations developed in this dissertation, research topics that
may be pursued in the context of optimizing these multi-dimensional summation
calculations include the following.
5.1.1 Generalization of The Class o f Nested-Loop Computations Handled
In the form of multi-dimensional summations considered in this dissertation, loop
bounds are assumed to have constant ranges and arrays are assumed to be directly
indexed by a hst of distinct loop indices. Relaxation o f these restrictions would facili
tate the optim ization of a wider class of nested-loop com putations but complicates the
optimization problems at the same time. Consider the following multi-dimensional
summation:
i] X B[2i -t- j , i - i])i=:l j=i+l
In this example, the loop bounds of j are affine functions o f i and the array subscripts
are no longer distinct single loop indices. Since the iteration space is triangular
instead of rectangular, the number of arithmetic operations is reduced by a factor.
Furthermore, if array A is generated one element at a time rather than precomputed,
we only need to generate a diagonal plane of A, which is the only portion of .4 involved
in forming S. These changes could affect the operation count, the memory usage, and
the amount o f communication in evaluating the m ulti-dimensional summation. Such
129
effects on the cost models and on the existing optim ization algorithms need to be
studied.
5.1.2 Optimization of Cache Performance
Memory access cost can be reduced through loop transformations such as loop
tiling, loop fusion, and loop reordering. Although considerable research on loop trans
formations for locality has been reported in the literature [1, 3, 9, 14, 30, 39, 42, 67],
they generally do not consider issues related to the need to perform both loop fusion
and loop tiling together for locality and memory usage. Initial work on this problem
is reported in [13]. For a fully associative cache and a restricted class o f expression
trees in which each node is equivalent to a matrix multiplication, an algorithm is
presented for finding the tile sizes, loop fusions, and the partitioning of work among
processors that minimize cache misses while keeping memory usage under a given
limit. Further work on the generalization of the class of expression trees handled by
the algorithm is desirable.
5.1.3 Optimization of Disk Access Cost
Another performance metrics to be optimized is disk access cost. Some of the
input arrays may be disk-resident and the output array may need to be written to
disk for further analysis or visualization. When the amount of available memory is
tight, we can allocate a buffer for each input array and fill it with a portion of the array
read from the disk on an on-demand basis. Similarly, elements of the output array
can be stored in a buffer and written to disk when the buffer is full. If the optimal
loop fusion configuration still requires more memory than available, in addition to
trading operations for memory (see Section 3.9), we can also trade disk accesses for
130
memory. This is done by saving intermediate arrays to disk in chunks after their
production and reading them back from disk before their consumption. The buflFer
size for writing an array to disk and reading it back may be different. In general,
the number o f disk read/writes for an array is inversely proportional to the size of
the buffer allocated for the array. Thus, this optim ization problem can be stated as
follows: find a fused loop structure with disk read/write statements that implements
a given expression tree with the lowest disk access cost while not using more memory
than a given limit.
Such fused loop structures with disk read/w rite statements can be represented by
an extended form of fusion graphs (see Section 3.1). First, we extend an expression
tree by replacing each leaf node with a "read input array” node, adding a "write
output array” node as the parent of the root node, and inserting a “write intermediate
array” node and a “read intermediate array” node before every intermediate array and
its parent. Then, we extend a fusion graph by adding, for each “read” or “write” node,
a set of vertices, one for each dimension o f the array. In effect, an extended fusion
graph models disk read/write statem ent inside loop nests and allows us to consider
different possible fusions of the loops around array evaluation statem ents as well as
disk read/write statem ents. An algorithm sim ilar to the memory usage m inimization
algorithms (see Section 3.3 and 3.5) can be applied on an extended fusion graph to
find the loop fusions that result in minimal disk access cost under memory constraint.
To follow this approach, it is necessary to characterize the relationship between loop
fusions, buffer size, and disk access cost, and propose changes to existing algorithms
to handle disk accesses.
131
5.1.4 Developm ent of an Autom atic Code Generator
After implementing the optimization algorithms presented in this dissertation,
a logical next step is to develop an autom atic code generator that takes a multi
dimensional sum m ation as input and synthesizes the source code o f a parallel program
that computes the desired multi-dimensional sum m ation and is optim al in arithm etic
operation count and communication cost while not exceeding the available memory.
The automatic code generator would first use the operation m inim ization algorithm
to obtain a operation-minimal formula sequence, and then use the memory usage min
imization algorithm and the communication minimization algorithm to determine the
optimal loop fusions and the partitioning of the arrays and com putations am ong the
processors. Based on the results returned from the optim ization algorithms, together
with information about the target machine architecture and the target language, the
source code of a parallel program that computes the given m ulti-dim ensional summa
tion can be produced autom atically. Some issues that need to be addressed for code
generation include:
• Generating a fused loop structure from loop fusion information. Sections 3.3
and 3.5 have provided sufficient information on how a fused loop structure
corresponding to a fusion graph can be formed. Adding the loop bounds and
the array initialization, allocation, and deallocation statem ents to a fused loop
structure are quite straightforward.
• Specifying in the source code the distribution of arrays and com putation among
processors. This is language dependent. In MPI, for exam ple, this can be
achieved by reading or computing a different portion of an array depending on
1 3 2
the process ID. In some other languages, preprocessor directives or pragmas
may be used to specify how the arrays and loop iterations are partitioned.
• Redistributing arrays. For each array to be redistributed, we first have to de
termine the communication pattern in terms of which processors need to send
and/or receive data to/from which other processors and, for each source and
destination processor pair, which part o f the local array the source processor
needs to send. Some communication patterns can be classified as one or more
groups of processors performing concurrent broadcast or personalized commu
nication, which could be one-to-all or aU-to-all. To implement array redistribu
tions, collective communication library routines such as those in MPI or PVM
may be used. Efficient algorithms for array redistributions can be found in
[25, 47, 51, 53, 59, 60, 61. 63, 64].
• Computing fast Fourier transforms (FFTs). FFTs can be implemented by call
ing library routines such as FFTW . Section 3.8 describes which portions of a
source array and a result array participate in an FFT function call.
133
BIBLIOGRAPHY
[1] Anant Agarwal, David Kranz, and Venkat Natarajan. "Automatic partitioning of parallel loops for cache-coherent multiprocessors” . In InteTnation Conference on Parallel Processing, pages 12-111, St. Charles, XL, August 1993.
[2] Anant Agarwal, David Kranz, and Venkat Natarajan. “Autom atic partitioning of parallel loops and data arrays for distributed shared memory multiprocessors” . IEEE Transactions on Parallel and Distributed Systems, 6(9):946-962, September 1995.
[3] Jennifer M. Anderson, Saman P. Amarasinghe, and Monica S. Lam. “Data and computation transformations for multiprocessors” . In A C M SIG PL A N Symposium on Principles and Practice of Parallel Programming, pages 166-178, Santa Barbara, CA, July 1995.
[4] Jennifer M. Anderson and Monica S. Lam. “Global optim izations for parallelism and locality on scalable parallel machines” . In A C M SIG PL A N Conference on Programming Language Design and Implementation, pages 112-125, Albuquerque, NM, June 1993.
[5] Andrew W. Appel and Kenneth J. Supowit. “Generalizations of the Sethi-Ullman algorithm for register allocation” . Software—Practice and Experience, 17(6):417- 421, June 1987.
[6 ] W . Aulbur. Parallel implementation of quasiparticle calculations of semiconductors and insulators. PhD thesis. The Ohio State University, October 1996.
[7] Rajeev Barua, David Kranz, and Anant Agrawal. “Communication-minimal partitioning of parallel loops and data arrays for cache-coherent distributed- memory multiprocessors” . In Languages and Compilers fo r Parallel Computing, pages 350-368, San Jose, August 1996.
[8 ] Steve Carr and Ken Kennedy. Compiler blockability o f numerical algorithms. Technical Report CRPC-TR92208-S, Rice University, April 1992.
134
[9] Steve Carr, Kathryn S. McKinley, and Chau-Wen Tseng. "Compiler optim izations for improving data locality” . In Sixth International Conference on A rchitectural Support fo r Programming Languages and Operating Systems, pages 252-262, San Jose, CA, October 1994.
[10] Siddhartha Chatterjee, John R. Gilbert, Robert Schreiber, and Shang-Hua Teng. “Autom atic array alignment in data-parallel programs” . In 20th Annual A C M S IG A C T S /S IG P L A N Symposium on Principles o f Programming Languages, pages 16-28, New York, January 1993.
[1 1 ] Siddhartha Chatterjee, John R. Gilbert, Robert Schreiber, and Shang-Hua Teng. “Optimal evaluation of array expressions on massively parallel machines” . A C M TOPLAS, 17(1):123-156, January 1995.
[12] Miachal Ciemiak, Wei Li, and Mohammed Javeed Zaki. “Loop scheduling for heterogeneity” . In Fourth International Symposium on High Performance Distributed Computing, August 1995.
[13] Daniel Cociorva, John W ilkins, Chi-Chung Lam, and P. Sadayappan. “Transformations for parallel execution of a class of nested loops on shared-memory multi-processors” . Subm itted for publication.
[14] Stephanie Coleman and Kathryn S. McKinley. “Tile size selection using cache organization and data layout” . In A C M SIG PL AN Conference on Programming Language Design and Implementation, La Jolla, CA, June 1995.
[15] Jack Dongarra, Loic Pry Hi, Cyril Randriamaro, and Bernard Tourancheau. “Array Redistribution in ScaLAPACK using PVM ”. In Second European P V M Users Group Meeting, Lyon, France, September 1995.
[16] Jeanne Ferrante, Vivek Sarkar, and Wendy Thrash. “On estim ating and enhancing cache effectiveness” . In Fourth International Workshop on Languages and Compilers for Parallel Processing, pages 328-343, Santa Clara, CA, August 1991.
[17] C. N. Fischer and R. J. LeBlanc Jr. Crafting a Compiler. Benjam in/Cum m ings, Menlo Park, CA, 1991.
[18] Guang R. Gao, Russell Olsen, Vivek Sarkar, and Radhika Thekkath. “Collective loop fusion for array contraction” . In Languages and Compilers fo r Parallel Processing, pages 171-181, New Haven, CT, August 1992.
[19] Guang R. Gao, Vivek Sarkar, and Shaohua Han. “Locality analysis for distributed shared-Memory multiprocessors” . In Languages and Compilers fo r P arallel Computing, pages 20-40, San Jose, August 1996.
135
[20] Michael R. Garey and David S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Com pleteness. W. H. Freeman, New York, 1979.
[21] John R. Gilbert and Robert Schreiber. “Optimal expression evaluation for data parallel architecture” . Journal of Parallel and Distributed Computing, 13:58-64, September 1991.
[22] Leonidas J. Guibas and Douglas K. W yatt. “Compilation and delayed evaluation in APL.”. In Fifth A nnual A C M Symposium on Principles o f Programming Languages, pages 1-8, Tucson, Arizona, January 1978.
[23] Manish Gupta and Prithviraj Banerjee. “PARADIGM: A compiler for automatic data distribution on m ulticom puters.” . In A C M International Conference on Supercomputing, Tokyo, Japan, July 1993.
[24] M. S. Hybertsen and S. G. Louie. “Electronic correlation in semiconductors and insulators: band gaps and quasiparticle energies” . Phys. Rev. B, 34:5390, 1986.
[25] S. D. Kaushik, C.-H. Huang, J. Ramanujam, and P. Sadayappan. “Multiphase redistribution: A communication-efiB.cient approach to array redistribution” . IEEE Transactions on Parallel and Distributed Systems, 1995.
[26] S. D. Kaushik, C.-H. Huang, J. Ramanujam, and P. Sadayappan. “Multi-phase redistribution: m odeling and evaluation”. In International Parallel Processing Symposium, pages 441-445, April 1995.
[27] Wayne Kelly and W illiam Pugh. “A Unifying Framework for Iteration Reordering Transformations” . In International Conference on Algorithms A nd Architectures fo r Parallel Processing, pages 153-162, Brisbane, Australia, April 1995.
[28] Ken Kennedy and K athryn S. McKinley. “Optim izing for parallelism and data locality” . In A C M International Conference on Supercomputing, pages 323-334, Washington, DC, July 1992.
[29] Ken Kennedy and K athryn S. McKinley. “M aximizing loop parallelism and improving data locality v ia loop fusion and distribution” . In Languages and Compilers for Parallel Computing, pages 301-320, Portland, OR, August 1993.
[30] Dattatraya Kulkarni and Michael Stumm. Loop and data transformations: a tutorial. Technical Report CSRI-337, University o f Toronto, June 1993.
[31] Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis. Introduction to Parallel Computing: Design and Analysis of Algorithms. Benjam in/Cum m ings, RedW ood City, CA, 1994.
136
[32] Chi-Chung Lam, Daniel Cociorva, Gerald Baumgartner, and P. Sadayappan. M emory-optimal evaluation of expression trees involving large objects. Technical Report OSU-CISRC-5/99-TR13, Dept, of Computer and Information Science, The Ohio State University, May 1999.
[33] Chi-Chung Lam, Daniel Cociorva, Gerald Baumgartner, and P. Sadayappan. “M emory-optimal evaluation of expression trees involving large objects” . In International Conference on High Performance Computing, Calcutta, India, December 1999.
[34] Chi-Chung Lam, Daniel Cociorva, Gerald Baumgartner, and P. Sadayappan. “O ptim ization of memor>' usage and communication requirements for a class of loops implementing multi-dimensional integrals” . In Languages and Compilers fo r Parallel Computing, San Diego, August 1999.
[35] Chi-Chung Lam, P. Sadayappan, Daniel Cociorva, Mebarek Alouani, and John W ilkins. “Performance optim ization of a class of loops involving sums of products of sparse arrays” . In Ninth SIA M Conference on Parallel Processing for Scientific Computing, San Antonio, TX, March 1999.
[36] Chi-Chung Lam, P. Sadayappan, and Rephael Wenger. “Optimal reordering and mapping of a class of nested-loops for parallel execution” . In Languages and Compilers for Parallel Computing, pages 315-329, San Jose, August 1996.
[37] Chi-Chung Lam, P. Sadayappan, and Rephael Wenger. “On optimizing a class of m ulti-dimensional loops with reductions for parallel execution”. Parallel Processing Letters, 7(2): 157-168, 1997.
[38] Chi-Chung Lam, P. Sadayappan, and Rephael Wenger. “Optimization of a class of multi-dimensional integrals on parallel machines” . In Eighth SIAM Conference on Parallel Processing fo r Scientific Computing,, Minneapolis, MN, March 1997.
[39] Monica S. Lam, Edward E. Rothberg, and Michael E. Wolf. “The cache performance and optim izations of blocked algorithms”. In Fourth International Conference on Architectural Support fo r Programming Languages and Operating Systems, pages 63-74, Palo Alto, CA, April 1991.
[40] Wei Li. Compiling fo r NUMA Parallel Machines. PhD thesis, Cornell University, August 1993.
[41] Wei Li. Compiler optimizations for cache locality and coherence. Technical Report 504, University of Rochester, April 1994.
[42] Wei Li. “Compiler Cache Optimizations for Banded Matrix Problems” . In International Conference on Supercomputing, Barcelona, Spain, July 1995.
137
[43] C. C. Lu and W . C. Chew. "Fast algorithm for solving hybrid integral equations". IEEE Proceedings-H, 140(6):455-460, December 1993.
[44] Naraig Manjikian and Tarek S. Abdelrahman. "Array data layout for the reduction of cache conflicts". In International Conference on Parallel and Distributed Computing Systems, pages 111-118, Orlando, FL, September 1995.
[45] Naraig Manjikian and Tarek S. Abdelrahman. “Fusion of loops for parallelism and locality” . In International Conference on Parallel Processing, pages 11:19-28, Oconomowoc, WI, August 1995.
[46] Naraig Manjikian and Tarek S. Abdelrahman. “Reduction of cache conflicts in loop nests” . Technical Report CSRI-318, University o f Toronto, March 1995.
[47] Philip K. McKinley, Yih-Jia Tsai, and Darid F. Robinson. A survey of collective communication in wormhole-routed m assively parallel computers. Technical Report M SU-CPS-94-35, Michigan State University, June 1994.
[48] Edmund K. Miller. “Solving bigger problems by decreasing the operation count and increasing com putation bandwidth” . Proceedings of the IEEE, 79(10):1493- 1504, October 1991.
[49] Ikuo Nakata. “On compiling algorithms for arithmetic expressions” . Communications of the Association for Computing Machinery, 10:492-494, 1967.
[50] Miodrag Potkonjak, Mani B. Srivastava, and Anantha P. Chandrakasan. “Multiple constant multiplications: eflBcient and versatile framework and algorithms for exploring common expression elim ination” . IEEE Transactions on Computer- aided Design of Integrated Circuits and Systems, 15(2):151-164, February 1996.
[51] Loic Pry Hi and Bernard Tourancheau. Block cyclic array redistribution. Technical Report 95-39, Ecole Normale Supérieure de Lyon, October 1995.
[52] W illiam Pugh. “The Omega test: a fast and practical integer programming algorithm for dependence analysis” . Communications of the ACM, 8:102-114, August 1992.
[53] Shankar Ramaswamy and Prithviraj Banerjee. Autom atic generation of efficient array redistribution routines for distributed memory multiprocessors. Technical Report UILA-ENG-94-2213, CRHC-94-09, University of Illinios, April 1994.
[54] H. N. Rojas, R. W . Godby, and R. J. Needs. “Space-time method for Ab-initio calculations of self-energies and dielectric response functions of solids” . Phys. Rev. Lett., 74:1827, 1995.
138
[55] Loren Schwiebert and D. N. Jayasimha. "Optimal Fully Adaptive Minimal Wormhole R outing for Meshes” . Journal o f Parallel and Distributed Computing, 2 7 (l):56 -70 , May 1995.
[56] Ravi Sethi. “Complete register allocation problems” . SIA M Journal of Computing, 4(3):226-248, September 1975.
[57] Ravi Sethi and J. D. Ullman. “The generation of optim al code for arithm etic expressions” . Journal o f the Association fo r Computing Machinery, 17(l):71-5- 728, O ctober 1970.
[58] Sharad Singhai and Kathryn McKinley. “Loop fusion for data locality and parallelism” . In Mid-Atlantic Student Workshop on Programming Languages and Systems, SUN Y at New Paltz, April 1996.
[59] Rajeev Thakur and Alok Choudhary. “A ll-to-all communication on meshes with wormhole routing” . In International Parallel Processing Symposium, pages 561- 565, April 1994.
[60] Rajeev Thakur, Alok Choudhary, and Geoffrey Fox. “Runtime array redistribution in HPF programs” . In Scalable High Performance Computing, pages 309-316, May 1994.
[61] Rajeev Thakur, Alok Choudhary, and J. Ramanujam. “Efficient algorithms for array redistribution”. IEEE Transactions on Parallel and Distributed Systems, 7(6):587-594, June 1996.
[62] Yih-Jia T sai and Philip K. McKinley. “An extended dom inating node approach to collective communication in all-port wormhole-routed 2D meshes” . In Scalable High Performance Computing Conference, pages 199-206, Knoxville, TN, May 1994.
[63] Akiyoshi Wakatani and Michael Wolfe. “A new approach to array redistribution: Strip M ining Redistribution” . In Parallel Architecture and Languages Europe, pages 323-335, July 1994.
[64] Akiyoshi W akatani and Michael Wolfe. “O ptim ization of Array Redistribution for D istributed Memorj' M ulticomputers” . Parallel Computing, 21(9):1485-1490, September 1995.
[65] S. W inograd. Arithmetic Complexity of Computations. Society for Industrial and Applied M athematics, Philadelphia, 1980.
[6 6 ] P. W inter. “Steiner problems in networks: a survey” . Networks, 17:129-167, 1987.
139
[67] Michael E. W olf and Monica S. Lam. “A data locality optimization algorithm”. In S IG P L A N ’91 Conf. on Programming Language Design and Implementation. pages 30-44, Toronto, Canada, June 1991.
[68] Michael Wolfe. High Performance Compilers fo r Parallel Computing. Addison- Wesley, 1996.
[69] Jingling Xue. “Communication-minimal tiling of uniform dependence loops” . In Languages and Compilers fo r Parallel Computing, pages 330-349, San Jose, August 1996.
140