parallel-architecture-directed program transformation

18
Parallel Computing 18 (1992) 1363-1380 1363 North-Holland PARCO 734 Parallel-architecture-directed program transformation David Sharp, Martin Cripps and John Darlington Dept. of Computing, Imperial College of Science, Technology& Medicine, London SW7 2BZ, UK Abstract Sharp, D., M. Cripps and J. Dariington, Parallel-architecture-directed program transformation, Parallel Computing 18 (1992) 1363-1380. The widespread use of parallel machines has been hampered by the difficulty of mapping applications onto them effectively. The difficulty arises because current programming languages require the programmer to specify a problem to be solved at a low level of abstraction in an imperative form. Thus the programmer must immediately encode an architecture-specific algorithm detailing every communication and calculation. This process is prone to error and complicates the reuse of software. An alternative approach is to specify the problem to be solved at a high-level in a functional language. Meaning-preserving program transformations can then be used to derive a parallel algorithm. Such algorithms can be run on parallel graph-reduction or dataflow machines which automatically exploit the implicit parallelism in a fu~ctional language program. Such automatic decomposition techniques, however, are not yet capable of fully yielding the extra performance offered by the parallel hardware. We show how, by including an architecture specification with the problem specification, and extending the amount of transformation performed, it is possible to produce functional language code that explicitly expresses the calculations and communications to be performed by the processors. This simplifies compilation, yields faster programs and enables parallel software to be developed for a wide variety of parallel computer architectures. A goal-seeking transformation methodology has been developed which enables a high-level functional specification of the problem and a high-level functional abstraction of the target computer architecture to be systematically manipulated to produce an efficient parallel algorithm tailored to the target architecture. As the transformations start from very high-level specifications, the discovery of new algorithms is facilitated. A case study is used to demonstrate the effectiveness of the technique. We show how a high level specification for sort can be transformed with a pipefine architecture specification to give a mergesort and how the same specification with a dynamic-message-passing architecture s~e,:ification can be transformed to a novel parallel quicksort. Keywords. Functional language; program transformation; parallel computer; program synthesis; formal meth- ods. 1. Introduction We present a methodology for systematically synthesising algorithms for various parallel architectures which works by using an architecture specification to narrow down the choice of possible algorithms that can compute some problem specification on the target architecture. Thus we overcome one of the main problems associated with program synthesis by unfold ~fold Correspondence to: David Sharp, Dept. of Computing, Imperial College of Science, Technology & Medicine, London SW7 2BZ, UK, email: [email protected] 0167-8191/92/$05.00 © 1992 - Elsevier Science Publishers B.V. All rights reserved

Upload: david-sharp

Post on 21-Jun-2016

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Parallel-architecture-directed program transformation

Parallel Computing 18 (1992) 1363-1380 1363 North-Holland

PARCO 734

Parallel-architecture-directed program transformation

David Sharp, Martin Cripps and John Darlington Dept. of Computing, Imperial College of Science, Technology & Medicine, London SW7 2BZ, UK

Abstract

Sharp, D., M. Cripps and J. Dariington, Parallel-architecture-directed program transformation, Parallel Computing 18 (1992) 1363-1380.

The widespread use of parallel machines has been hampered by the difficulty of mapping applications onto them effectively. The difficulty arises because current programming languages require the programmer to specify a problem to be solved at a low level of abstraction in an imperative form. Thus the programmer must immediately encode an architecture-specific algorithm detailing every communication and calculation. This process is prone to error and complicates the reuse of software. An alternative approach is to specify the problem to be solved at a high-level in a functional language. Meaning-preserving program transformations can then be used to derive a parallel algorithm. Such algorithms can be run on parallel graph-reduction or dataflow machines which automatically exploit the implicit parallelism in a fu~ctional language program. Such automatic decomposition techniques, however, are not yet capable of fully yielding the extra performance offered by the parallel hardware. We show how, by including an architecture specification with the problem specification, and extending the amount of transformation performed, it is possible to produce functional language code that explicitly expresses the calculations and communications to be performed by the processors. This simplifies compilation, yields faster programs and enables parallel software to be developed for a wide variety of parallel computer architectures. A goal-seeking transformation methodology has been developed which enables a high-level functional specification of the problem and a high-level functional abstraction of the target computer architecture to be systematically manipulated to produce an efficient parallel algorithm tailored to the target architecture. As the transformations start from very high-level specifications, the discovery of new algorithms is facilitated. A case study is used to demonstrate the effectiveness of the technique. We show how a high level specification for sort can be transformed with a pipefine architecture specification to give a mergesort and how the same specification with a dynamic-message-passing architecture s~e,:ification can be transformed to a novel parallel quicksort.

Keywords. Functional language; program transformation; parallel computer; program synthesis; formal meth- ods.

1. Introduction

W e p r e s e n t a m e t h o d o l o g y f o r sys temat ica l ly syn thes i s ing a lgo r i thms fo r va r ious para l le l

a r c h i t e c t u r e s w h i c h w o r k s by us ing a n a r c h i t e c tu r e spec i f i ca t ion to n a r r o w d o w n the cho ice o f poss ib le a l g o r i t h m s t h a t c a n c o m p u t e s o m e p r o b l e m spec i f ica t ion o n t he t a r g e t a rch i t ec tu re .

T h u s w e o v e r c o m e o n e o f t he m a i n p r o b l e m s assoc ia ted wi th p r o g r a m synthes is by unfold ~fo ld

Correspondence to: David Sharp, Dept. of Computing, Imperial College of Science, Technology & Medicine, London SW7 2BZ, UK, email: [email protected]

0167-8191/92/$05.00 © 1992 - Elsevier Science Publishers B.V. All rights reserved

Page 2: Parallel-architecture-directed program transformation

1364 D. Sharp et al.

transformation [5] which is that there are too many choices of possible transformation paths at any stage in the synthesis. With an architecture specification present the synthesis is much more focusse~l on the need to remove redundant computat ions by introducing interprocessor communication.

Our technique has already been applied to produce parallel algorithms for problems as diverse as dynamic programming, tessellation of the plane, fractai image generation and Fourier Transformation. In this paper we outline the technique and demonstrate it in action to produce two parallel sorting algorithms for two different parallel architectures.

1.1. Archi tec ture classification

We present transformations for two classes of computer architecture: static architectures and dynamic-message-passing architectures. Figures 1 and 2 show functional language repre- sentations of these two types of architecture.

dec fl , f2, f3: list num -> list num ; ! Q and i are of type lit hum ; ! (f o g) x = fig(x));

Q=( f3 o f2o fl) i ;

i l

i2

i3

i4

dec fl , [2, f3, f4:list num -> list num; dec fS, f6, fT: list num X list num -> list num ; [ Q is of type list hum ; • ) i l , i2, L,,'" i.~ are list num;

Q == fT(x5, x6) where x5 ~ fS(xl, x2) end where x6 = - f6(x3, x4) end where xl = - f l( i l) end where x2 =--- f2(i2) end where x3 =-- f3(i3) end where x4 =--- f4(i4) end;

type FourStreams - - list num X list num X list numX list num; dec f l l , f12 ..... f'32, f33 : FourStreams -> FoutStreams ; ! T is of type list listFourSlreams ;

let T =-- [ [fl 1([], [], west(T at (1,2)), north(T at (2,1))), fl2(east(T at (1,1)), l], west(T at (1,3)), north(T at (2,2))), fl3(east(T at (1,2)), [], [], north(T at (2,3))) ] ,

[f21([], south(T at (1,1)), west(T at (2,2)), nort h(T at (3,1))), f22(east(T at (2,1)), south(T at (1,2)), west(T at (2,3)), north(T at (3,2))) f23(east(T at (2,2)), south(T at (1,3)), [], nort h(T at (3,3))) ] ,

[t31([], south(T at (2,1)), west(T at (3,2)), [] ) , f32(east(T at (3,1)), south(T at (3,2)), west(T at (3,3)), [] ), t'33(east(T at (3,2)), south(T at (2,3)), [], [] ) ] ] in T ;

Fig. I. Functional representation of static architectures.

Page 3: Parallel-architecture-directed program transformation

Parallel-architecture directed program transformation 1365

Processor

Routing Network

Initial - Messages

Answer Messages

Fig. 2. Functional representation of a dynamic message passing architecture (t~wlocalstatte and newcontents and destination are calculated from contents and Iocalstate. lnitial Messages are determined by the problem specification.

The algorithm's purpose is to generate the answer messages).

Static architectures have fixed interprocessor connections and these are represented straightforwardly in a functional language: processors are modelled by functions and interpro- cessor communication can be modelled by function composition ( fog). The expression ( f o g)x, which means f(g(x)), indicates that the processor calculating f takes its input from the processor calculating g(x).

Dynamic-message-passing architectures have a message routing mechanism which routes messages of the form MSG(destination, contents) to the appropriate destination processor. Thus any processor may send messages to any other processor. These architectures may be represented in a functional language using functions to model the processors and set abstraction to model the message routing. The router may be physically implemented in various ways: in the ALICE machine [4] it is implemented using a delta network switch [8]; in the Thinking Machines Connection Machine CM2 [22] the routing is achieved by hypercube hardware and routing software.

2. The methodology

A high-level problem specification is combined with a target architecture specification to produce a specification of the problem on the architecture. In the case of a static architecture, we unfold and transform the problem specification until it is in a form where its function structure is isomorphic to that of the static architecture representation. The functions for each processor on the architecture are then abstracted and compiled to machine code.

For a dynamic-message-passing architecture the specification of the problem is cast in terms of what answer messages are required in response to some input messages. Transforma- tions are carried out to remove redundant calculations by introducing additional message- passing. This allows processors that require intermediate results calculated by another processor to receive them in a message rather than recalculate the contents of the message locally. The transformation ends when any need to access global data in the specification has

Page 4: Parallel-architecture-directed program transformation

1366 D. Sharp et al.

possible proHem specifications)

Fourier Transform

_

- Problem Specification

(few architecture specifications)

Uraph-Keduction I DatM~low

Architecture Specification

Specification of ~oblem on Architecture

R--v r. nd. i s - - J I I - - m - - . o I i R--v *dundanci*s s ze adjust eliminate local ] partially evaluate to a ~ X - recalculation of values fixed size network and

already computed aggregate tasks for elsewhere processors

Fig. 3. Methodology overview.

been transformed out and an efficient algorithm has emerged. An overview of this process is illustrated in Fig. 3.

3. Worked example- -Sort ing

We have chosen to use sorting as a worked example and show how to use the methodology to derive parallel sorting algorithms for pipelined and dynamic-message-passing architectures.

3.1. Sort specification

A sorted list is a permutation of the original list that is in order and preserves the relative ordering of equal elements. Thus the position in the sorted list of element Xj (which is in position j in the unsorted list) is 1 plus the number of elements smaller than Xj. plus the

Page 5: Parallel-architecture-directed program transformation

Parallel-architecture directed program transformation 1367

number of elements equal to Xj that are to the left of it in the unsorted list. We specify sort in terms of the function posn which returns the position of an item (the first argumen0 in a list (its second argument). The first item of a list is in position 1.

posn(Xi, sort[ X I , . . . , X,,l) = 1 + # { X i < X j ]1 < i < n } + # { X i = X j l1 _<i _<j} (1)

We wish to transform this specification into a parallel algorithm for a pipeline architecture and into another parallel algorithm for a dynamic-message-passing architecture.

4. Transformation to pipeline architecture

4.1. Specification on pipeline architecture

We wish to pipe the n items to be sorted through the pipeline and for them to emerge at the other end in sorted order, smallest item first. Clearly it will take O(n) time to pipe the elements through so, for a near optimal algorithm we need a pipe of length O(log n). Thus we have the following architecture-specific problem specification:

(fo0og n, ° ' ' " ° f3 ° f2 ° f i ) X = sort X.

Our objective is to derive the functions f l , . . - , fo0og ,>. Unfolding the specification of sort gives:

fo0o o . . . o f2 o f i x

-- [ (X j Iposn(Xp sort X ) = 1) , . . . , (x j [posn(X i, sort X ) = n ) ] .

Consider the final element of the pipe, fo(iog ,)- The first thing it does is to produce the smallest item. As we are doh~g comparison based sorting the smallest item in the list is determined as the result of a comparison between two items. Before this comparison there are two elements that are contenders for smallest and afterwards there is only one. The situation before the smallest-element-determining-comparison is therefore something like the following:

X i < X ~ < X ~ . . . l < _ i < n , l < _ j < n , l < _ k < _ n , i#= j~=k " ' " and Xp < Xq < X , " " l ~ p <_ n, l <_ q <_ n, q < r < n, p # q # r " "

In this case the smallest element is either Xi or Xp and one comparison will determine the smallest. Suppose Xi was the smallest. The next most small element is then either Xp or Xj. We are merging two sorted lists! Clearly the pipeline architectural specification of sort is suggested that a mergesort is suitable for use with the pipeline architecture.

Consulting our transformation library, we discover that a transformation already exists from a high-level specification of sort, to recursive mergesort [3]. So we can reuse this transformation and now simply have to map a recursive mergesort onto a p;.peline. This can be done as shown in Fig. 4.

We transform, using a standard type transformation, the function TreeStage, that maps the data at one tree level to the next, to a function PqpeStage that maps the data at the input of one pipe stage to its output.

The data at each tree level can be represented as a l ist( l istnum); for example [[3,7], [2,8], [1,6], [4,5]] represents the output of the four vertical merges. The data at each pipe stage can be represented as a (l ist list num X list list num ); for example ([[3,7],[1,6]], [[2,8],[4,5]]) represents the output of the first PipeStage function.

Page 6: Parallel-architecture-directed program transformation

1368 D. Sharp et ai.

I1,2,3,4,5,6,7,81 12,3,7,81

merge

[1,4,5,61

11,2,3,4,5,6,7,81 1 I2,3,7,8]

merge

merge

I3,71

I4,5

[3,71 [1,61 i m

nothing ! 1,4,5,61 I2,81 14,51

merge

merge

merge .-......--.-~111 merge

t "~[51

t31 [81 [!! 15] Fig. 4. Mapping of tree mergesort to pipelined mergesort.

The structure of the three can be represented by defining a function buildtree that connects together layers of the tree defined using TreeStage. (In the following definitions the type-variable alpha will take on the type list hum and f will be instantiated to merge).

dec b u i t d t r e e : ( a L p h a X a L p h a - > a l p h a ) X L i s t a l p h a - > a l p h a ;

- - - b u i L d t r e e ( f p x s ) <= b u i l d * n e e ( f , T r e e S t a g e ( f , x s ) ) ;

- - - b u i L d t r e e ( f , x : : y : : [ 3) <= f ( x t y ) ;

dec T r e e S * a g e : ( a l p h a × a l p h a - > a l p h a ) × L ~ s t a l p h a - > L i s t a l p h a ;

- - - T r e e S * a g e ( f , x s : : y s : : r e s t s ) <= f ( x s , y s ) : : T r e e S t a g e ( f , r e s t s ) ;

- - - T r e e S , a g e ( f , [ x s ] ) <= [ x s ] ;

- - - T r e e S * a g e ( f , C ] ) <= [ ] ;

It can be easily verified that mergesort is equwalent to - ° - m e r g e s o n t ( x s ) <= b u i l d t r e e ( m e r g e , m a p ( L a m b d a y => [ y ] , x s ) ) ;

where lambda represents an anonymous function and map is the usual higher-order function that applies a function (the first argument) to each element of a list (the second argument):

m a p ( f , [ ] ) <= [ ] ;

m a p ( f , x : : r e s t ) <= f ( x ) : : m a p ( f , r e s t ) ;

A data type transformation can be used to synthesize the function PipeStage that maps the input of one stage of the pipe to its output (Fig. 5).

Page 7: Parallel-architecture-directed program transformation

Parallel-architecture directed program transformation 1369

TreeStage list list num list list num

1 I TreeRep PipiRep I PipeStage

l is t l is t num X l i s t l is t nura , ~ l ist l is t num X l is t l is t num

Fig. 5. Data type transformation to synthesize PipeStage: PipeStage = PipeRep(TreeStage(TreeRep)).

The function PipeRep converts data in a layer in the tree to the corresponding layer in the pipe and the function TreeRep converts data in a layer in the pipe to that in the corresponding layer in the tree. PipeStage has the same effect as TreeREp followed by TreeStage followed by PipeRep.

In the transformations below PipeStage has been defined so that it takes the function to be performed on the data (i.e. merge in this case) as an extra parameter. This is to illustrate that the transformation will not only work for mergesort but for any divide and conquer algorithm t h a t cab be r~Qrt;,~ll . . . . I . . ~ A . . . . . v,-,-,,,,y e v , , , o ~ u to a tree algorithm in which the amount of work at each level of the tree is constant. The use of higher order functions in this manner will allow libraries of standard transformations to be built up and thus will enable considerable computer assistance with mapping of specifications onto architectures. (Again the type-variable alpha is actually list num.)

dec P i p e S t a g e : ( a l p h a X a l p h a -> a l p h a ) × l i s t a l p h a X L i s t a l p h a -> L i s t a l -

p h a × L i s t a l p h a ; P i p e S t a g e ( f , x s , y s ) <= PipeRep(TreeStage(f, T r e e R e p ( x s , y s ) ) ;

dec P i p e R e p : L i s t a l p h a -> L i s t a L p h a X L i s t a l p h a ; !PipeRep is the function that converts the tree layer to the corresponding pipe layer.

P i p e R e p ( x s ) <= (odds x s , evens x s ) ;

dec odds" L i s t a l p h a -> L i s t a l p h a ;

odds ~ : : y : : r e s t <= x : : o d d s r e s t ;

odds i x ] <= i x ] ;

- - - odds [ ] <= [ ] ;

dec e v e n s : L i s t a l p h a -> L i s t a l p h a ;

evens x : : y : : r e s t <= y : : r e s t ; ;

- - - evens i x ] <= [ ] ;

evens [ ] <= [ ] ;

dec TreeRep : L i s t aLpha X L i s t a l p h a -> L i s t a l p h a ; !TreeRep is PipeRep -~, i.e. the function which converts the pipe inputs to the corresponding tree inputs.

T r e e R e p C x : : x s , y : : y s ) <= x : : y : : T r e e R e p ( x s , y s ) ;

T r e e l 2 e p ( [ ] , [ 3) <= ( [ ] , [ ] ) ; !TreeRep is only meant to meant to work for list of length 2".

Page 8: Parallel-architecture-directed program transformation

1370 D. Sharp et ai.

instantiafion: P i p e S t a g e ( f . x l : : x 2 : : x s , y l : : y 2 : : y s ) <= P ipeRep(T reeS tage ( f . T r e e R e p ( x l : : x 2 : : x s , y l : : y Z : : y s ) ) ) ;

Unfold TreeRep : <= P~peRepfTreeStage( f , x l : : y l : : x 2 : : y 2 : : T r e e R e p ( x s , y s ) ) ) ;

Unfold TreeStage: <= P i p e R e p ( f ( x l . y l ) I : f ( x 2 , y 2 ) I : T r e e S t a g e ( f , T reeRep(xs . y s ) ) ) ;

Undid PipeRep: <= ( f ( x l . y l ) : : r e s t l ,

where ( r e s t l t r e s t 2 ) f ( x 2 . y 2 ) : r e s t 2 ) == P ipeRep (T reeS tage ( f . TreeRep(xs, y s ) ) ) ;

Fold PipeSmge: <= ( f ( x l t y l ) : : r e s t l , f g x 2 , y 2 ) : : r e s t 2 )

where ( r e s t 1 . r e s t 2 ) == P i p e S t a g e ( f , xse y s ) ;

Thus: P i p e S t a g e ( f t x l : : x 2 : : x s , y l : : y 2 : : y s ) <= ( f ( x l . y l ) : : r e s t l , f ( x 2 , y 2 ) : : r e s t 2 )

where ( r e s t 1 , r e s t 2 ) := P i p e S t a g e ( f , xs . y s ) ;

This is the function that each of the processors in the pipe needs to run, with f instantiated to merge. It takes O(n) time on O(!o8 n) processors for a list of length n to be sorted and thus the synthesized pipeline mergesort is optimal.

The synthesis did not exploit any particular properties of mergesort and is general divide and conquer algorithm to pipeline transformation providing the divide and conquer tree contains equal amounts of work at each level of the three.

$. Transformation to dynamic-message-passing architecture

Transformations to a dynamic-message-passing architecture are achieved by reasoning about the set of messages passed between the processors. The main transformation tools are free-message-instantiation for introducing new messages and message-folding which enables a value to be used from an incoming message instead of recomputing it locai|y. The transforma- tion is achieved by introducing rules which state which messages arise in response to which other messages. The first rule states what initial messages start the calculation off and what answer messages must be produced in response.

5.1. Speci~cation on dynamic-message.passing architecture

We start the sort with one record per processor and sort the records with respect to the enumeration of the processors. Thus the smallest item is moved to processor a (the lowest numbered processor carrying out the sort), the next smallest item to processor a + 1 and so on up to the largest item which is sent to processor a + n - 1.

Thus processor a + j , which has record Xj to begin with, wishes to calculate posn(Xj, sort X) and send its record to processor a + posn(Xj, sort X) - 1.

Page 9: Parallel-architecture-directed program transformation

Parallel-architecture directed program transformation 1371

We can send a continue sort message MSG(j , CS(Xj, a, a - 1 + length(X), i, X) contain- ing one of the n items Xj to be sorted (and other parameters to enable us to write a workable specification), to each processor a + j l0 _<j _< n - 1 and in response to it, the processor must send out an answer message MSG(a + posn(Xj, sort[X l, . . . . X , ] ) ) - 1 , ANS(X~)) to the processor that needs to receive Xj at the end of the sort.

Rule 1. For all i ~ 1 (i is an integer used to disambiguate messages from different recursive calls to sort) MSG(a + j , CS(Xj, a, a - 1 + l eng th (X) , i , X ) ) ~ messages 0 _<j < length X - 1 = MSG(a + posn(X~, sort X) - 1, ANS(x)) ~ messages where posn returns the position of an item in a list (numbered from 1) and messages denotes the set of all messages that exist in the evaluation of quicksort.

This rule, which expresses a property of the messages, operationally implies that when processor a + j receives a message MSG(a +j, CS(Xj, a, b, i, X)) it is responsible for ensur- ing that a message MSG(a + posn(Xj, sort X), ANS(Xj)) is produced. This is because only processor a + j is aware of the existence of messages whose destination is a +j, and thus it must make rule 1 hold. Rule 1 has a base case, which is when only a single item is being sorted. In this case j = 0, a = b and posn(Xj, sort X) = 1.

5.Z Architecture-specific specification

We can use rule 1 to specialise a dynamic-message-passing architecture (DMPA) specifica- tion to sort a list X = [ X 1 . . . . . X . ] to give a list v=ry1 . . . . . v,3 on processors numbered k = a to b. ( b - - a + n - 1). Processor k receives Xk initially and receives Yk at the end of the algorithm.

DMPASort(X, a, b )= [Y 1 . . . . . Y.] where

MSG(k, ANS(Yk)) Emessages messages={RSG(a+j, ¢S(Xa+ j, O, n-l,

{Pk(filter(k, messages). 1) filter(k, ms)={mEms I m=RSG(k, _)}

1, X)) I O ~ j ~ n - 1 } U 1 0 ~ k ~ n - 1 }

pk(Ressagesln, i ) = Let (MSG(a+j , CS(Xj, a, b, i , X) ) }UOtherRessagesln=RessagesLn in

i f (a=b) then {RSG(k. ANStX j ) ) }UPk(OtherRessages ln . i + 1 ) e lse FreeRessagesOut U {RSG(des t i na t i on , ANS(Xj))} U Pk(OtherRessagesln,

i + 1 ) uhere d e s t i n a t i o n = a + p o s n ( X j e so r t X ) - I

The last parameter (i) of Pk is equal to the number of the iteration of the current call to Pk and is incremented on each new recursive call to Pk" It is used to disambiguate messages from different recursive calls. The initial program contains the free variable F reeMessagesOut in the message stream emerging from Pk; any value of F r e eM e s s a g e s 0 u t that is consistent with rule 1 provides a correct specification for sort; for example an empty set. To prove that this specification satisfies rule 1, messages can be instantiated to MSG(a + j , CS(Xj, a, a - 1 + length(X), 1, h X ) ) l l < j < length X and the program code

Page 10: Parallel-architecture-directed program transformation

1372 D. Sharp et al.

can be unfolded until MSG(a + posn(X~, sort X) - 1, ANS(X~))II < j A length X appears in the messages as well.

The function Pk relies on access to the whole of the list to be sorted (X) and this is initially present as the last parameter of the CS message sent to processor k but X will be removed during the transformation.

Moreover, each processor individually calculates the position of its item in the final list and sends out a corresponding answer message. Clearly this is not a very efficient parallel algorithm as each processor duplicates all of the sorting work. The aim of the transformation process is to remove this redundancy, replacing it by inter-processor communication whereby useful results computed by one processor are transmitted to the processor that needs them.

6 . T r a n s f o r m a t i o n ' ~

Consider processor a +j . It is trying to move its record X~ to a processor which is higher numbered than the destination processors of all items less than X~, and all items equal to Xj which were originally on a lower numbered processor than Xj. Processor j can use EQN 1 (the sort specification in terms of position) to calculate the processor to which its time should be sent.

There are two summations and n comparisons in EQN 1 and since the records are distributed across the processors, interprocessor communication is required to carry them out. In order to carry out the comparisons, the value of the item on processor a + j is required by all processors k, a _< k _< b. This item can be broadcast to all processors by processor a + j in O(1) time if a broadcast mechanism is available (as it is, for example, on the Connection Machine) or in O(log n) time using a tree connection of processors.

Comparisons of processor a + j ' s item with everyone elses takes O(1) time in parallel and the summations can be done in O(log n) by employing a binary tree connection of the processors. Processor a + j can then send its item to its final destination.

The other processors could carry out similar calculations to those of processor a + j and then send their items~.;~ the appropriate destinations. (This would carry out an enumeration sort with many duplicated calculations.) Alternatively the information gathered by processor a + j can be reused by the other processors. Each processor knows whether its item is less than, equal to or greater than the item on processor a + j , Xj. Clearly all items greater than Xj must be sent to a higher numbered processor than Xfs destination and items less than Xj must be sent to a lower numbered processor than Xfs destination.

Without further calculation, the processors do not know exactly which processor to send their items to, but suppose, as an initial approximation, the items were just sent to an appropriately numbered processor (compared with the destination processor for X~) preserv- ing their original ordering. On processors a tO a + # { X i < Xj I1 < i ~ n} - 1 the original conditions of a sort would have been recreated for items greater than Xj. Performing these two sorts and moving items equal to Xj into the remaining processors preserving their original ordering would clearly be the basis for a parallel quicksort with Xj as the pivot element (Fig. 6).

Once again we can use a library transformation from a specification of sort to recursive quicksort [3] and tailor this to the dynamic-message-passing architecture by introducing messages to determine and broadcast the pivot, to work out the sum of the items less than, equal to and greater than the pivot and to arrange for items to be regrouped on appropriate processors. The regrouped items could then be sorted using the same algorithm.

Extra message passing is introduced by instantiating the free variable F r • • N • s s a g e s 0 u t in the specification of Pk. For example, suppose the set of messages already contains the

Page 11: Parallel-architecture-directed program transformation

Farallel-archit~cture directed program transformation 1373

Item ]" g • ~ a c Processor 1 1 2_ 3 ..4 5

. ~ . ~ _ C h o o s e d as pivot

items<pivotb a c items=pivotd l i '7~:~'; '° t l

I 1 l rocessor 1 2 3 5 6 7

Fig. 6. Parallel quicksort.

message MI = MSG(destl , contentsl). To introduce some message M2 = MSG(dest2, con- ~ents2) into the set of messages, a new rule is added which states that M1 ~ messages = M2

messages. Processor destl (which received MI) is then charged with ensuring that M2 appears. This is achieved by instantiating Pdesu'S free variable FreeMessagesOut to M2 U FreeMessagesOut2. Processor dest2 will then receive contents2 in a message. If contents2 appears as a sub-expression in the body of Pd,,tZ, for example in a where clause, its calculation can be replaced by the contents of the message. The message is extracted from Messagesln using a let-clause.

For example, in the body of the message recipient, we can replace the expression E(x) where x --y + z with let MSG(k, ValueOfxldx)) u rest -- Messagesln in E(x) providing we have introduced the message 'ValueOfxls' by instantiating FreeMesmgesOut of some other processor. In this way a novel parallel quieksort can be formally synthesized. For the full mathematical synthesis the reader is referred to another paper [18]. The operation of the synthesized algorithm is as follows. Consider sorting the first seven letters of the alphabet on seven processors numbered 1 to 7 initially organised as a depth first numbered tree (Fig. 7).

The first Stage of the algorithm requires the processors to agree on a pivot element which they will all use. The scheme in Fig. 8 can be used to find a pivot near the mean of the elements.

Figure 8 shows how each processor (except the leaf processors) receives a triple from its children containing the best pivot so far, the sum of the elements so far, and the number of elements so far. (For the purposes of summing the values of the items to be sorted, the letters of the alphabet have been assigned values as follows: a -- 1, b -- 2, c -- 3 , . . . , g = 7).

a c d f

Fig. 7. Depth first numbered tree with items to be sorted.

Page 12: Parallel-architecture-directed program transformation

1374 D. Sharpetal.

(~,28,7)

(c,9,3)

e

(a,l,l) (c,3,1)

a

(d, 12,3)

b

(d,4,1) (f,6,1)

d f

Fig. 8. Calculate pivot.

\%° Fig. 9. Broadcast pivot.

The processor adjusts the sum and number of elements so far so that its own element is included and sends these to its parent together with the new best pivot so far. The new best pivot so far is the one of the three elements ~o~-.,n to the processor that is nearest to the mean so far. For example processor two, which contains e chooses e as the best pivot to send to processor one because c is closer in value to 9 / 3 = 3 than a or e. The pivot can be broadcast down the tree in O(log n) time (Fig. 9). The processors each compare their item with the pivot and, as shown in Fig. 10, produce a triple: (1, 0, 0) if the item is less than the pivot, (0, 1, 0) if the item is equal to the pivot, and (0, 0, lY if the item is greater than the pivot.

g >d

10,0,i) b,<d (i,0,0)

/\ /\ a <d O<d d =d f >d

(i,0,0) (i,0,0) (0,i,0) (0,0,i) Fig. I0. Comparison with pivot (pivot is d).

(3,1,3)

(2,0,I)

e

(I,0,0) (I,0,0)

a

(i,I,I)

b

(0, i ,0) (0,0, i)

Fig. 11. Add the triples.

Page 13: Parallel-architecture-directed program transformation

Parallel-architecture directed program transformation 1375

/ , , , (I,4,7) (2,4,7)

a c

(1,4,5)

(1,4,6) (3,4,7)

D e s t £ n a t 4one

12 g e

; \ (4,4,7) (4,5,7)

d £

Fig. 12. Determine destination messages.

We can add up the numbers of items less than, equal to, and greater than the pivot in O(log n) time (Fig. 11). The answer (3, 1, 3) indicates that three items are less than the pivot and will be sent to processors 1 to 3, that one item is equal to the pivot and will be sent to processor 4 and that three items are greater than the pivot and will be sorted on processors 5 to 7.

Let # L be the number of items less than the pivot, # E be the number of items equal to the pivot and # G be the number of items greater than the pivot. Consider some processor a + j : if its item is less than the pivot then the destination for the item is a plus the numbet ~ ,,f items on processors a to a + j - 1 that are less than the pivot. If its item is greater than the pivot then the destination is # L + # E + the number of items greater than the pivot on processors lower numbered than j. If its item is equal to the pivot then the destination is # L + the number of items equal to the pivot on processors lower numbered than j .

The, root processor (in this case processor 1; the one with g on it) sends the g to the lowest numbered available processor sorting items greater than the pivot (processor 5). It then informs its left child that the lowest numbered available processor destinations for items less than, equal to and greater than the pivot are (1, 4, 6) respectively. The root processor knows that its left subtree contains (~,, 0, 1) items less than, equal to and greater than the pivot respectively and that its item is greater than the pivot and thus uses one destination processor for items greater than the pivot. It therefore informs its right child that (1, 4, 6) + (2, 0, 1) = (3, 4, 6) are the destination pr~essor numbers available to the right child. Each child uses the same technique to inform its children of the available processors as shown in Fig. 12.

In Fig. 12 each processor (except the root) receives a triple from its parent, adds its pivot comparison triple (from Fig. 10) and sends the result triple to the left child. It then adds the triple received from the left child in Fig. 11 and sends the new result to the right child. It does not matter that some of the processor numbers go out of range because they wou't be used. For example (4, 5, 7) is sent to the processor which has f on it and it does not matter that the 5 is out of range because only tbe 7 will be used. The sort continues separately for the items less than and greater than the pivot (Fig. 13).

• , , , , , ,

Fig. 13. Continue sort using new trees.

Page 14: Parallel-architecture-directed program transformation

1376 D. Sharp et al.

The sort ends when the number of processors in each new sort group is one. At this point processor a + i will contain the ith smallest element because the algorithm continually sends smaller items to lower numbered processors.

Machine Algorithm Execution Time No. of Procs. Memory per Proc.

SIMD Merge-Split O(n) + log n O(n/p) Linear o((nl°gn)/p) Array ----- . . . . . - . . . . . . . . . . . . . -------------- . . . . . . ---------------- . . . . . - . . . . -------------------------

Odd-Even O(n) n O(1) Transcription

Enumeration O(n) n O(1)

Pipeline Merge O(n) 1 + log n O(n) . . . . N . . . . . . . . . - - . . . . - - . . . N . . . . . . . . . . . . . . - - - - . . . . . . . . . . . . - - - - - - - - . . . . . . . . . . . . - . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Tree Bucket sort O(n) (2log n) - I n/m at leaves with merge n/(2i ) at level 0<_i<log m n=2 m root at level 0 . . . . . . . . . . . . . . . . . - - . . . . . . . . . . . . . . . . . . . . . . . . . . - - - . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Median find O(n) n n/m at leaves and split n/(2i ) at level 0<_i<iog m n=2 m root at level 0

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . - . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ~ . . . . . . . . . . ~ . . - . . . . . . .

Mesh Bitonic o(nlp log n/p) p < (log n) 2 4n/p + % + 2 n/p O(log ~p)2

Perfect Merge-Split O(n/s log n/s) s/2 o(n/s), s<2q0og n) Shuffle + O(n/s log2s) Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . n=2 m Bitonic O(log n) 2 n/2 O(1) + n stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . - - . . . . . . . . . . . . . . . . . . . .

SIMD+ Odd-Even Merge O(log n) 2 O(n log n) 2 O(1) special . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . network Bitonk" Merge O(log n) 2 O(n log n) 2 O(1)

MIMD or Parallel O(Iog n) 2 n O(1) SIMD Quieksort

Hypercube Enumeration O(log n) n 2 0 x l ~ f ~ • SIMD - - . - - - . . . . . . . - - . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . - . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . _ . . . . .

Shared Bucket log n n O(nm) Memory "~ elements in range I .. m-I [Hirschberg 78]

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . _ . . . . . . . _ . . . . . .

Enumeration O(1/a log n) n l+a O(n) [Preparata 78] O<P.~ 1

I l I I I - - I I t I I I I I I I l I I I I i i I I I I i l i l I i i I I I i I i i i l i l t i i l I i I ~ . . . . . . . I i i I l i i i I i l l i i i i I I i i i i t I l i I I i . . . . . . . . i I i . . . . . . . . . . . .

Fig. 14. Performance of some parallel sorting algorithms. (n ffi no. of items to sort, p -- no. of processors, s = no. of storage modules. If n ffi 2 m is shown then the algorithm only works when n is apower of two.)

Page 15: Parallel-architecture-directed program transformation

Parallel.architecture direct(d program transformation

7. Performance of the algorithms

1377

Figure 14 shows the performance of the synthesized sorts in relation to other sorting ~.Igorithms.

None of the algorithms in the table (Fig. 14) manages to sort n items in O(log n) time on n processors using a reasonable amount of memory. The h~aercube enumeration sort requires n z processors, the Hirschberg bucket sort requires large amounts of memory, and the shared memory enumeration sort requires more than n processors.

Despite the success of other parallel sorting algorithms [1,2,10,11,12,15], previous attempts to paraUelize quicksort [14,7,6] have not achieved a speedup greater than six [9]. This is because they did not use a communication-intensive strategy and thus were crippled by the ovelheads of fine grain dynamic task allocation and load balancing.

Martel and Gusfield have suggested a way of running quicksort in O(log n) time on n processors [13], but they require a shared memory machine which allows concurrent reading and writing of the same memory location by all the processors (with write conflicts resolved by random choice). Such an architecture is currently impractical. Thus the O ( I o g 2 n ) quicksort described here and related algorithms [16] have the lowest complexity and promise the fastest execution speed for parallel quicksort on commercially available parallel hardware.

The quicksort has been executed on 8192 processors of a Thinking Machines Connection Machine CM2 and gives excellent results in comparison with enumeration sort and bubble sort (Fig. 15). The theoretical complexity of our parallel quicksort is time = k(Iog n) p where p is 2. Taking logs we have log t ime= log k + p log(log n). Thus plotting log time against log log n should yield a straight line of slope p. Figure 16 shows that, for the Connection Machine, p - 2.7. This is in good agreement with the theoretical prediction: the hypercube hardware and routing software of the Connection Machine has added an extra O(log n) °'? factor onto the execution time of the O(log n) 2 algorithm.

1Z T Bubble 0

lo I / e 8

/ Enumeration ? :

t ; p' quic so

0 I I 0 I O0 1000 10000

No Of Items

Fig. 15. Ouicksort, bubble sort and enumeration sort on the Connection Machine. (No. of items = No. of processors used.)

Page 16: Parallel-architecture-directed program transformation

1378 D. SharpetaL

¢D .E_ i - e -

. J

2

0

-1

-2

-3 0

/ / 1 2 3

Ln Ln No of Items Fig. 16. The slope of this graph is p in O(Iog n) p.

Pipeline Architecture Specification

~ Sort Specification

Library Library Architecture Transform Transform Specification

Mergesort Quicksort ~

I I Partially Transform Evaluate

Mergesort on Tree ]

I Transform

Parallel Quicksort with Redundant

Calculations

I Def'me Messages and Rules

I Mergesort on I and Transform Pipe

Parallel Quicksort on Message-Passing

Architecture

Fig. 17. Transformation path followed.

Page 17: Parallel-architecture-directed program transformation

Parallel-architecture directed program transformation 1379

& Conclusion

We have presented a systematic technique for synthesising fast algorithms for static and dynamic-message-passing parallel computer architectures and illustrated the technique by generating a pipelined mergesort and a novel message-passing parallel quicksort (Fig. 17).

The pipelined mergesort is optimal and the message-passing quicksort proved to be more scaleable and faster than other comparison-based sorts including enumeration sort and bubble sort, when run on an 8192 processor Connection Machine CM2. The synthesis technique has been applied to the generation of various other algorithms [17] including dynamic programming [19], Fourier Transformation [20], Tessellation of the Plane and Fractal Image Generation [21].

It is a feature of this architecture-directed transformation mechanism that both a library of machine architecture specifications and a library of transformations is built up and can be reused successfully. The mechanism described is such that it should come as no surprise that new general transformation tactics and new algorithms are derived by its use.

Acknowledgements

We wish to thank members of the Functional Programming group at Imperial College for their suggestions, Thorn EMI Central Research Laboratories and the Science and Engineer- ing Research Council of Great Britain for their financial support and Rail Zimmer of GMD FIP, Bonn, for his help with running the parallel quicksort on the Connection Machine.

References

[1] S.G. Akl, "Parallel Sorting Algorithms (Academic Press, New York, 1985). [2] D. Bitton, D. Dewitt, D. Hsiao and J. Menon, "A Taxonomy of Parallel Sorting Algorithms", Computing

Surveys, Vol. 16, No. 3, 1984, p. 287. [3] K.L. Clark and J. Darlington, Algorithm classification through synthesis, Computer J. 23 (1) (1980) 61-65. [4] M.D. Crippz, J. Darlington, A.J. Field, P.G. Harrison and M. Reeve, The design and implementation of ALICE:

A parallel graph reduction machine, in Selected reprints on dataflow and redaction architectures, ed., S.S. Thakkar, (IEEE Computer Society Press, Silver Spring, MD, 1987).

[5] .L Darlington and R.M. Burstall, A system which automatically improves programs, Acta Inform. 6 (1976) 41-60. [6] J. Deminet, Experience with multiprocessor a|~Iorithms, IEEE Trans. Comput. C-31 (Apr. 1982) 278-288. [7] D.J. Evans and Y. Yousif, Analysis of the performance of the parallel quicksor~method, BIT 25 (1985) 106-112. [8] A.J. Field and M.D. Cripps, Self-clocking networks, Proc. IEEE lntemat. Conf. Parallel Processing (ICPP),

Chicago (Aug. 1985) 384-387. [9] R.S. Francis and I.D. Mathieson, A benchmark parallel sort for shared mem~ry multiprocessors, IEEE Trans.

Comput. 37 (12) (Dec. 1988) 1619-1626. [10] A. Gibbons and W. Rytter, Efftcient Parallel Algorithms (Cambridge University Press, Cambridge, 1988). [11] D.S. Hirschberg, Fast parallel sorting algorithms, Commun. ACM 21 (8) (Aug. 1978) 657-666. [12] S. Lakshmivarahan, S.K. Dhali and L.L. Miller, Parallel sorting algorithms, Advances in Computers, Vol. 13

(Academic Press, New York, 1984) 295-354. 113] C.U. Martel and D. Gusfield, A fast parallel quicksort algorithm, Inform. Pmce,.'sing Letters (Jan. 1989) 97-102. [14] P. Moller-Nilesen and J. Stannstrap, Problem-heap: A paradigm for multipro¢:essor algorithms, Parallel Comput.

4 (1987) 63-74. [15] F.P. Preparata, New parallel sorting schemes, IEEE Trans. Comlmt., C-27 (7) (July 1978) 669-673. [16] D.W.N. Sharp and M.D. Cripps, A parallel implementation strategy for quicksort, Proc. 1989 Internat. Syrup. on

Computer Architecture and Digital Signal Processing, Vol. 1, Hong Kong (Oct. 89) 305-309. [17] D.W.N. Sharp, "Functional language program transformation for parallel computer architectures", Ph.D. thesis,

Dept. of Computing, Imperial College, London Univ., 1990.

Page 18: Parallel-architecture-directed program transformation

1380 D. Sharp et ai.

[18] D.W.N. Sharp, P.G. Harrison and J. Darlington, A synthesis of a dynamic message-passing algorithm for quicksort, Internal Report DoC 91/'19, Imperial College, May 91.

[19] D.W.N. Sharp, H. Khoshnevisan and AJ. Field, A case study in the synthesis of parallel functional programs for message-passing architectures, Int. Report DoC 91/18, Imperial College, May '91.

[20] D.W.N. Sharp and M.D. Cripps, Synthesis of the Fast Fourier Transform algorithm by functional language program transformation, Int. Report DoC 91/17, Imperial College, May '91.

[21] D.W.N. Sharp and M.D. Cripps, Parallel algorithms that solve problems by communication, Proc. Third IEEE Symp. on Parallel and Distributed Processing, Dallas, TX (Dec. 91).

[22] Connection Machine model CM-2 technical summary, Thinking Machines Corporation Technical Report Series, HA87-4, 1987.