[ieee comput. soc. press 11th international parallel processing symposium - genva, switzerland (1-5...

Parallel Solutions of Indexed Recurrence Equations

Yosi Ben-Asher Dep. of Math. and CS.

Haifa University 3 1905 Haifa, Israel

[email protected]

Abstract

A new t p e of recurrence equations called “indexed recurrences ’’ (IR) is dejned, in which the common notion of X [ i ] = o p ( X [ i ] , X [ i - 11) i = 1 . . . n is generalized to

{ 1 . . . m}. This enables us to model sequential loops of the X[g(i)l = .P(X[f(i)l,X[h(i)l) f , s , h : (1 . . ..I +-+

form for E = 1 to n do begin

X[g(i)l := 0P(X[f(i)l, X[h( i ) l ; )

as IR equations. Thus, a parallel algorithm that solves a set of IR equations is in fact a way to transform sequential loops into parallel ones. Note that the circuit evaluation problem (CW) can also be expressed as a set of IR equations. Therefow an eflcient parallel solution to the general IRproblem is not likely to be found, as such solution would also solve the CVE showing that P C NC. In this paper we introduce parallel algorithms for two variants of the IR equations problem:

0 An O(1og n ) gwedy algorithm.for solving IR equations where g ( i ) is distinct and h( i ) = g ( i ) using O ( n ) processors.

An O(log2 n) algorithm with no restriction on f , g or h, using up to O ( n 2 ) processors. Howevel; we show that for general IR, op must be commutative so that a parallel computation can be used.

1. Introduction

We consider a certain generalization of ordinary recurrence equations called indeced recurrence (IR) equations. Given an initialized array A[ 1 ..vi], a set of n IR equations have the form A[g(i)] := op(A[f(i)], A[h(i)]) which can be represented by a sequential loop of the form

for z = 1 to n do begin A[g(z)l := oP(A[f(i)l,A[h(i)l);

Gadi Haber IBM Science and Technology

3 1905 Haifa, Israel haberBhaifasc3 .vnet.ibm.com

where op is a binary associative operator and where f , g : { 1 ..n} H { 1 ..m} do not include references to elements of the An array itself.

The goal is to use the parallel solutions of these IR equations in order to parallelize sequential loops whose execution can be simulated by a set of IR equations. This is similar to the way that piaillel solutions of linear recurrences (A[i] = op(A[i - 11, A [ i ] ) ) are used to parallelize sequential loops [2] of the form:

for E = 1 to n do begin A[i] := op(A[i - I ] , A[i]);

In our work, we analyzed the well known Livermore Loops [ 11 and checked how many of them fit into the general frame of IR equations in compare to ordinary recurrence equations. There are 24 loops in this code, often used as a benchmark for parallelizing compilers, and contain typical code for scientific computing. Out of the 24 loops we found that: loops 1,7,8,12,15,16,22 do not contain recurrences of any type; loops 3,5,11,19 contain linear recurrences; all other loops (except for 13,20,24) contain indexed recurrences.

2. Ordinary Indexed Recurrences.

This section describes the parallel algorithm for computing a set of IR equations where g ( i ) is distinct and h( i ) = g ( i ) . This case is simpler than the general one, and the parallel algorithm we obtain is more efficient than the one for the general case, and uses O(n) processors. It is easy to begin with the sequential algorithm namely, the following loop:

Array A[ 1.. m] with initial values; for z = 1 to n do

A[dz)l := A[f(z)l A[s(z)l;

For convenience, we have: replaced the notation of op(z, y) with 2 @ y, where @ is the suitable binary and associative operation. Note that op is not necessarily a commutative operation, therefore our algorithm should preserve the “multiplications” order (i.e. the order of @ operations).

1063-7133/97 $10.00 0 1997 IEEE 413

http://vnet.ibm.com

f o r i = 1 t o n do A[2i] := A[i + 11 . A[2i]; A f t e r 8 i t e r a t ions :

A’[l] = A[1] A’[2] = A[2] . A[2] A’[3] = A[3] A’[4] = A[3]. A[4]

A’[5] = A[5] A’[6] = A[3] . A[4] . A[6] A’[7] = A[7] A’[8] = A[5] . A[8]

Figure 1. An example of an Ordinary ZR loop.

The above loop can be viewed as a function for computing a new value A’ = F(A, n, f , g) (also denoted as OrdinarylR(A, f, g, e)) where A is the initial array and A’ is the array after executing the loop. We therefore need to find a parallel algorithm that computes F ( k , n, f , g) in less than n steps. This is analogous to the way in which prefix-sum [3] is used to solve ordinary recurrence equations F(A, op(z , y)) = prefix-sum(A, o ~ ( E , y)). The value of A‘[g(i)] is a product of a subset of elements in A. As an example consider the loop in fig. 1 where in every iteration i, A[2i] is updated by A[i + 11 A[2i]. Some of the elements A’[i] preserve their initial values, e.g. A’[7] = A[7] (since there is no 1 5 i 5 n such that g(i) = 7). While the “trace” of other A’[i] contain the multiplications of several elements, e.g. A’[6] = A[3] . A[4] . A[6] (since g(3) = 6, f(3) = 4, and then g(2) = f(3) and f(2) = 3). Finally, A[3] is the last item in the trace of A’[6] since there is no 1 5 i < 2 such that g ( i ) = f(2). The sequence of multiplications of every element in A’n (also called the trace of A[g(i)]) is given by the following lemma:

Lemma 2.1 Let A‘[i] denote the value ofA[ i] afier the execution of the loop

f o r z = 1 , . . . , n d o A[g(z)] = A[f(z)] @ A[g(z)]

then for all i = 1 . . .72

such that:

0 31 = a .

for t = 2 . . . k the indices j , satisfy that j t < j l - 1 and !?(jt) = f(A-1).

0 j k isthelast indexforwhichg(jt) = f(jt-l),i.e.,there is no 1 5 j k + l < j k such that g ( j k + l ) = . f ( j k ) .

Lemma 2.1 suggests a simple method for computing A’[g(i)] in parallel. Let A p t [g(i)] denote the sub-trace with t + 1 rightmost elements in the trace of A’[g(i)], i.e.,

AWt[g(i)] = A[f(~k-t)I @ . . . @ A[f(i)] @ A[g(i)]

Consider the “concatenation” (or “multiplication”) of two “successive” sub-traces:

[s(;)l = A-t’[s(dl @ A-t2[g(i)l A-(tl+tZ)

where g ( j ) = f ( j k - , , ) and A[f(jk-,,)] is the last element in A-t2[g(i)]. Note that A[g(j)] is multiplied twice, once as A[g(j)] and once as A[f(jk-,,)]. This can be corrected by taking the trace of its “predecessor” A-,’ [f(j)] so that

A-‘”’+””[g(z)] = A-t’[f(j)] A-t2[g(i)] =

A[f(jk,-t1)1 @ . ‘ . @ A[f(j)l @ A[f(.i;z-tz)l e.. . a3 A[s(i)l = A[f(jk,-tl)I CE . . . CE A[g(.i’)] @ A[g(j)l @ . . CB A[g(i)l

where j ’ < j and j ’ is the iteration number in which A[f(j)] is last updated in the loop.

The proposed algorithmis a simple greedy algorithm that keeps iterating until all traces are completed, where in each iteration all possible concatenations of successive sub-traces are computed in parallel. Thus, initially, we can compute in parallel the first product of each trace A[g(i)] = A[f(i)] @ A[g(i)] (for all i = 1, . . . , n). The concatenation operation of two successive sub-traces A-,l [g(j)], A-t2[g(j)] can be implemented using that:

0 the value of a sub-trace A-, [g( i)] is stored in its array element A[g(i)].

0 a pointer N[g(i)] points to the sub-trace A-‘l[g(j)] to be concatenated to APt2[g(i)] (to form A-(t1f‘2)[g(i)]). Hence, A[N[g(i)]] contains the value of the sub-trace

Initially all traces are of length 2, and can be computed in parallel. The code for a concatenation step of future iterations is therefore as follows:

multiplication- A[g(i)] = A-(‘1’ft2)[g(i)] = A[N[g(i)]]$

pointer updating- N[g(i)]-(tl+tZ) = N[N[g(i)]], where

A-tl[g(3)l.

A [g (41

N[1, . . . , m] is initialized as follows:

f ( i ) 0 Otherwise

32, 1 5 i 5 n andg(i) = x { 1. N [ z ] =

2. Since we start with traces of length 2, then for each i = 1..n N[g(i)] = N[N[g(i)]].

The way in which the concatenation operation works is depicted in fig. 2 showing two parallel concatenations of sub-traces. The operation N[g(i)]-(tl’ftz) = N[N[g(i)]] is depicted by the fact that the next-pointer of a new trace is taken to be that of the joined trace. The algorithm performs log n iterations. In each iteration, the above concatenation operation (the multiplication followed by the updating) is performed in parallel for all traces A’[g(i)]. As a result, in each iteration, either a trace is fully computed or the number

414

This is not an ordinary I R recurrence due to the non- associative nature of the operators fi(x) = s (where i = 1 2, . . . n). However, we can transform the recurrence into an ordinary I R problem by exploiting a useful quality of these operators as shown in the following theorem.

Lemma 2.2 Let there be two sets of functions f i ( x ) and g;(z) dejinedasfo1low.s: j ; ( z ) = s, gi(x) = a. Figure 2. The concatenation operation o f two traces.

ki la then f;(g;(z)) = 1- n1, .z+n, ’ where ( ma ni ) = of elements in the product of a trace is doubled due to the

multiplication

A-‘”’+””[g(i)] = A-tl[g(j)] @ A-t2[g( i ) ] . ( :i 2 ) ’ * ( el as follows for 2x2 matrices:

). The ’*’ operation is dejned 9’ h:.

A ifdet(A) = 0 A . B Otherwise

Hence, logn iterations are sufficient.

puted) we must not continue to concatenate any more traces to it. It therefore remains to determine when the computation of a trace has completed. In general, in every iteration and for every trace stored in A[g( i ) ] , the algorithm must determine:

1. the existence of A[g(j)] such that its trace can be con-

Clearly, once a trace has been “completed” (fully com- A * B =

From lemma 2.2, also known as Moebius Transformation, it followsthat thevaluesX[g( l)] . . . X[g(n)] oftherecurrence shown above, Can be ConlPuted by the following steps:

1. Initialize all matrices with appropriate coefficients:

M I initialized to (: !) catenated to the trace of A[g(i)J.

2. if the computation of the trace of A[g(i)] is completed, then no more redundant traces should be added to it.

forallz E {l..n} doinparallelM,(,) :=

2. Multiply the matrices:

A more efficient version of the algorithm which forks only up to P processes at the same time, was programmed and tested on the SimPurC [5] simulator. Hence, this version complexity is T(n , P ) = $ logn. Figure 3 shows the results obtained for an array of size n = 50,000 and for P = #pTocessoTs << n. The Y axis represents the complexity in units of assembly instructions.

The algorithms’ code is given in the full paper.

Figure 3. The results of running the OrdinaryIR algorithm for n=50,000.

2.1. Useful Application for the Ordinary IR Solu- tion.

Consider the following recurrence:

_ _ fori = 1 to n do M g ( , ) ::= &If(,) * M g ( t )

711. ‘I .Si,( 9 1 + mt2 forall i E { l..n} do in parallel X[g(i)] := m,3.S[g(l)l+m,,

3. Calculate the values of X [ g ( l ) ] . . . X[g(n)]:

Note that since step 2 is an ordinary IR, we can replace it with a call to OrdinuryIR(M, f l g, *) (where * is the modified matrix multiplication operation from lemma 2.2). Thus we transformed the recurrence into an ordinary IR problem which we already know how to solve. We can also produce a parallel solution to a slightly more complicated recurrence of the following form:

X[l..m] initialized to S[l..m] fori = 1 to n do

A[tI-X[f 1 l tB[a x[g(i)l := x[gr:i)l -k G[r].X[f{ii]+D[!]

Since g(i) is distinct, we can rewrite the above recurrence by replacing the variable X[g(i)] on the right hand of the ’ :=’ sign, with its initial value S[g(i)], without affecting the final values of X[g(l)] . . .X[g(n)]. This is allowed since the distinctness property of g(i) guarantees us that each assignment to X[g(i)] is the first one, and therefore each reference to X[g(i)] is a reference to its initial value. Thus we can bring the loop to its Moebius form as follows:

X[ l . .m] initialized to S[l..m] fori = 1 to n do

zy [9 ( ;)I : = ( S[g ( I)] ,C[ z ] + A[:l).X[f(:)lt( s S(*)l .~[i]’+ B[il) C[i]. X [f ( a)] + D /a]

415

producing the following Moeubius matrices:

As an example consider the recurrence taken from loop number 23, of the Livermore Loops benchmark [l]. The loop is a 2-D Implicit Hydrodynamics fragment:

X[ l . . n , 1..7] initialized to S for j = 2 to 6 do

for i = 2 to n do X [ i , j ] := X [ i , j ] +0.175dO. (Y[ i ] + X [ z - I , j ] . Z [ i , j ] ) ;

The inner loop can be viewed as an ordinary IR problem OrdinarylR(M, f , g, *) whereg(i) = 7( i - l)+j, f(i) = 7(i - 2) + j ,

0.175. Z[z,j] S[g( i ) ] + 0.175. Y[ i ] 1 vi M,(i) =

and where * is the operator from lemma 2.2. Thus, without using any data dependence analysis techniques, we managed to parallelize the loop, to be calculated in O(1og n) steps.

3. General Indexed Recurrences.

We now consider a more general case of IR equations (called GIR) which can be modeled by the loop:

fori = 1 to n do begin A[g(i)l := A[f(i)l @ A[h(i)l;

The greedy method used for the IR case (where g ( ) = h( ) ) is not suitable for GIR. Essentially, this is due to the difference in the structure of the trace A’[g(i)] in the two cases. As depicted in fig. 4, A‘[g(i)] in the GIR case is a binary tree, whereas in the IR case A’[g(i)] is a list.

g(i)- i R i p i-1 h(i)= i-2 z[i)- i Xi>- i -1 GIR: A[i] = A[i-l]*A[i-21 IR: ACi] = A[i-ll*Aril

\

Figure 4. Tree structure versus list structure of the trace.

The tree structure of the trace implies that the @ operator must be a commutative one. Clearly, the multiplication of traces’ values can be done either from the left or from the right end of a current trace value. The other problem that a GIR loop presents us with, is that traces can have an expo- nential length. For example consider the loop ’ f o r i = 2..n A[i] := A[i - 11 @ A [ i - 2]’, where A[O] = A[1] = U. In

this example the trace A’[n] = u2” consists of 2” multiplications. Therefore, in order for the parallelization of GIR loops to be efficient, the computation of a power (A[iIk) must be regarded as atomic operation. This assumption can also be found in previous works (e.g. [4]) where the multiplication operation was used in order to solve recurrences of additions. The GIR algorithm must therefore gather all identical elements of a trace and then, using the power operation, compute their product in a single operation. As an example, consider the above loop (A[i] := A[i- 1]-A[i-2]) , where A[O] and A[1] have different initial values. After the execution of the loop the trace is a multiplication of two powers A’[i] = A[O]fb(i -2) . A[ l ] fb ( i - ’ ) , where f b ( i ) is the i’th Fibonacci number. This trace is thus, best computed by first counting the powers of A[O] and A [ 11 in every trace separately (see figure 5). Indeed counting powers is sufficient to compute the traces not only for the above loop, but for any GIR loops as well.

A’[4]= A[ 013 * A[ 1

Figure 5. The expansion of the recurrence Xi = Xi- @ Xi-z for n = 4.

Counting all powers of Ad’s elements can be done using an intial “dependence” graph G‘ =< V, E’ >, showing dependences among the final values of An’s elements. The proposed algorithm computes the power of some element A [ j ] in a trace A’[i]. by counting the number of different paths between corresponding nodes j E V and i E V , in GI. Intuitively, each edge < i, j >E E’ of the dependence graph G’ indicates that A [ j ] is an operand in the assignment statement to A[i] of the GIR-loop. Thus, the power of A b ] in the trace of A’[i] is in fact the number of different paths leading from j to i in G‘. Computing all powers in every trace is therefore equivalent to counting all paths (CAP) between the nodes of G’. The particular variant of CAP needed for GIR-loops is defined as follows:

Definition 3.1 Let S c V be the set ofnodes with in-degree 0 (the “leaves” or buttom nodes) of a DAG G =< V, E >. Countingall thepaths C A P ( G ) is an operation that returns a labeled graph G’ = < V, E’ > such that an edge < i, j >Ix] i E V - SI j E S with the label [ x ] belongs to G’ ff there are exactly x paths from j to i in G.

For example let G be a double chain of n nodes VI -!+ 2712 --i 2 . . . -+ 2vn, such that there are two edges from t i i to v i + l . In this case G’ = C A P ( G ) is a DAG such

I 1

416

that there is a single edge from V I to every vi of the form

In order to solve a GIR loop we first create the dependence graph G' , and then computes all the paths in G1 in parallel G' = CAP(G'). G' is constructed such that an edge < i , j >["]E E' iff the power of A [ j ] in the trace A'[i] is exactly +. Finally, the trace of every element A'[i] is obtained by computing A'[i] = A[jl]"l @ . . . @ A[jkIZk where < i , j l >["']E C A P ( G ' ) I = 1,. . . , k. Thus, once we have the powers 21, . . . , Z k the trace can be computed in parallel in log IC steps.

The dependence graph G' = < V, E' > induced by a GIR-loop is defined as follows:

< vi, v1 >[2'1.

v = {g(l), 1 . . ,g(n), f(I)(,.. . , f(n)', h(l)", . . . W"} where f( i)' (or h(i)') represent initial values of A that will form the trace of the g ( i ) nodes. The edges in E include: for i = l..n

P

< g(i),h(i) >['I ifthereexistsj,j < isuchthatg(j) = h( i )

< g(i),j(a)' >[I1 i f there i snoj , j < i suchthatg(j) = f ( z )

<g(i) ,h(i)" >[11 i f there i snoj , j < zsuchthatg(j) = h(z)

For example, G' of the loop A[i] = A[i - 11 @ A[i - 21

Figure 6. The dependence graph produced by the recurrenceAi = A i - 1 @ A i - 2 f o r i = 2 , 3 , 4 .

is given in fig. 6. Our algorithm for computing C A P ( G ) uses log n itera-

tions (t = 2, . . . , logn), where in each iteration we update the edges of the current graph Gt-l =< V, Et-' > to form Gt =< VI Et > as follows:

l-Et = Et-'

2-Paths multiplication - For each < v i , V k >[Z1 E Et and a successive edge < vk , v j > [ Y l € E t , we add a new edge < vi, vj >[$.Yl to Et and mark < V k , vj >[yl to be deleted:

X ' V

Figure 7. Paths multiplication.

3-Deleting marked edges - remove each marked edge from Et . This step prevents us from recounting edges that were already taken under consideration in previous steps.

4-Paths addition- For each node vi E V replace all double edges < v i , q >["I] , . . . ,< v i , v j > L Z k ] € Et with a single edge (labeled by their sum) < vi, v j > [cf=1 :

Figure 8. Smuning double edges.

Two separate examples of'the above algorithmoperation are given in fig. 9. The new e'dges added (by path multiplication and path addition) in every iteration, are denoted by dashed lines.

Figure 9. Iterations o f two graphs.

The full algorithm along with a version which avoids spawning unnecessary processes, and a method for handling GIR with non-distinct g, are described in the full paper.

References

John T. Feo, "An analysis of the computational and parallel complexiq ofthe Livermore Loops", Journal of Parallel ComputingNo.7, 1988,pp. 163-185

H. S. Stone, 'Xn eficient Parallel Algorithm for the Solution o j a Trdiagonal Linear System ojequations", J. ACM20.27 (1 9 73) J. Jaja, ' X n Introduction to parallel algorithms", Addison- Wesley publishing conipany, 1992.

P. M. Kogge, H. S. Stone, "A Parallel Algorithm for the EB- cient Solution o j a General Class of Recurrence Equations", IEEE Transactions on' Computers, C22(8):786-793, August 1973

G. Haber, Y. Ben-Asher, "On the Usage of simulators to detect ineficiency o j parallel programs caused by "bad" schedulings: the SmrPARC approach': Accepted for pub- lication in the Journai! of Systems and Sofiware.

417

[ieee comput. soc. press 11th international parallel processing symposium - genva, switzerland (1-5...

Documents