models and languages for parallel computationrsim.cs.illinois.edu/arch/qual_papers/systems/4.pdf ·...

Models and Languages for Parallel ComputationDAVID B. SKILLICORN

Queen’s University

DOMENICO TALIA

Universita della Calabria

We survey parallel programming models and languages using six criteria to assesstheir suitability for realistic portable parallel programming. We argue that an idealmodel should be easy to program, should have a software developmentmethodology, should be architecture-independent, should be easy to understand,should guarantee performance, and should provide accurate information about thecost of programs. These criteria reflect our belief that developments in parallelismmust be driven by a parallel software industry based on portability and efficiency.We consider programming models in six categories, depending on the level ofabstraction they provide. Those that are very abstract conceal even the presence ofparallelism at the software level. Such models make software easy to build andport, but efficient and predictable performance is usually hard to achieve. At theother end of the spectrum, low-level models make all of the messy issues of parallelprogramming explicit (how many threads, how to place them, how to expresscommunication, and how to schedule communication), so that software is hard tobuild and not very portable, but is usually efficient. Most recent models are nearthe center of this spectrum, exploring the best tradeoffs between expressivenessand performance. A few models have achieved both abstractness and efficiency.Both kinds of models raise the possibility of parallelism as part of the mainstreamof computing.

Categories and Subject Descriptors: C.4 [Performance of Systems]: D.1[Programming Techniques]; D.3.2 [Programming Languages]: LanguageClassifications

General Terms: Languages, Performance, Theory

Additional Key Words and Phrases: General-purpose parallel computation, logicprogramming languages, object-oriented languages, parallel programminglanguages, parallel programming models, software development methods, taxonomy

INTRODUCTION

Parallel computing is about 20 yearsold, with roots that can be traced backto the CDC6600 and IBM360/91. In theyears since then, parallel computing has

permitted complex problems to besolved and high-performance applica-tions to be implemented both in tradi-tional areas, such as science and engi-neering, and in new application areas,

Authors’ addresses: D. B. Skillicorn, Computing and Information Science, Queen’s University, KingstonK7L 3N6, Canada; email: ^[email protected]&; D. Talia, ISI-CNR, c/o DEIS, Universita dellaCalabria, 87036 Rende (CS), Italy; email: ^[email protected]&.Permission to make digital / hard copy of part or all of this work for personal or classroom use is grantedwithout fee provided that the copies are not made or distributed for profit or commercial advantage, thecopyright notice, the title of the publication, and its date appear, and notice is given that copying is bypermission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute tolists, requires prior specific permission and / or a fee.© 1998 ACM 0360-0300/98/0600–0123 $05.00

ACM Computing Surveys, Vol. 30, No. 2, June 1998

such as artificial intelligence and fi-nance. Despite some successes and apromising beginning, parallel comput-ing has not become a major methodol-ogy in computer science, and parallelcomputers represent only a small per-centage of the computers sold over theyears. Parallel computing creates a rad-ical shift in perspective, so it is perhapsnot surprising that it has yet to becomea central part of practical applicationsof computing. Given that opinion overthe past 20 years has oscillated betweenwild optimism (“whatever the question,parallelism is the answer”) and extremepessimism (“parallelism is a decliningniche market”), it is perhaps a goodtime to examine the state of parallelcomputing. We have chosen to do thisby an examination of parallel program-ming models. Doing so addresses bothsoftware and development issues, andhardware and performance issues.

We begin by discussing reasons whyparallel computing is a good idea, andsuggest why it has failed to become asimportant and central as it might havebeen. In Section 2, we review some basicaspects of parallel computers and soft-ware. In Section 3, we discuss the con-cept of a programming model and listsome properties that we believe modelsof parallel programming ought to haveif they are to be useful for softwaredevelopment and also for effective im-plementation. In Section 4, we assess awide spectrum of existing parallel pro-gramming models, classifying them byhow well they meet the requirementswe have suggested.

Here are some reasons why parallel-ism has been a topic of interest.

—The real world is inherently parallel,so it is natural and straightforward toexpress computations about the realworld in a parallel way, or at least ina way that does not preclude parallel-ism. Writing a sequential program of-ten involves imposing an order on ac-tions that are independent and couldbe executed concurrently. The partic-ular order in which they are placed is

arbitrary and becomes a barrier tounderstanding the program, since theplaces where the order is significantare obscured by those where it is not.Arbitrary sequencing also makes com-piling more difficult, since it is muchharder for the compiler to infer whichcode movements are safe. The natureof the real world also often suggeststhe right level of abstraction at whichto design a computation.

—Parallelism makes available morecomputational performance than isavailable in any single processor, al-though getting this performance fromparallel computers is not straight-forward. There will always be appli-cations that are computationallybounded in science (the grand chal-lenge problems), and in engineering(weather forecasting). There are alsonew application areas where largeamounts of computation can be put toprofitable use, such as data mining(extracting consumer spending pat-terns from credit card data) and opti-mization (just-in-time retail delivery).

—There are limits to sequential com-puting performance that arise fromfundamental physical limits such asthe speed of light. It is always hard totell how close to such limits we are.At present, the cost of developingfaster silicon and gallium arsenideprocessors is growing much fasterthan their performance and, for thefirst time, performance increases arebeing obtained by internal use of par-allelism (superscalar processors), al-though at a small scale. So it istempting to predict that performancelimits for single processors are near.However, optical processors could pro-vide another large jump in computa-tional performance within a few de-cades, and applications of quantumeffects to processors may provide an-other large jump over a longer timeperiod.

—Even if single-processor speed im-provements continue on their recenthistorical trend, parallel computation

124 • D. B. Skillicorn and D. Talia


is still likely to be more cost-effectivefor many applications than usingleading-edge uniprocessors. This islargely because of the costs of design-ing and fabricating each new genera-tion of uniprocessors, which are un-likely to drop much until the newertechnologies, such as optical computa-tion, mature. Because the release ofeach new faster uniprocessor drivesdown the price of previous genera-tions, putting together an ensemble ofolder processors provides cost-effec-tive computation if the cost of thehardware required to connect them iskept within reasonable limits. Sinceeach new generation of processorsprovides a decimal order of magni-tude increase in performance, mod-estly sized ensembles of older proces-sors are competitive in terms ofperformance. The economics of pro-cessor design and production favorreplication over clever design. Thiseffect is, in part, responsible for thepopularity of networks of worksta-tions as low-cost supercomputers.

Given these reasons for using parallel-ism, we might expect it to have movedrapidly into the mainstream of comput-ing. This is clearly not the case. Indeed,in some parts of the world parallel com-puting is regarded as marginal. We turnnow to examining some of the problemsand difficulties of using parallelism thatexplain why its advantages have not(yet) led to its widespread use.

—Conscious human thinking appears tous to be sequential, so that there issomething appealing about softwarethat can be considered in a sequentialway—a program is rather like theplot of a novel, and we have becomeused to designing, understanding,and debugging it in this way. Thisproperty in ourselves makes parallel-ism seem difficult, although of coursemuch human cognition does takeplace in a parallel way.

—The theory required for parallel com-putation is immature and was devel-

oped after the technology. Thus thetheory did not suggest directions, oreven limits, for technology. As a re-sult, we do not yet know much aboutabstract representations of parallelcomputations, logics for reasoningabout them, or even parallel algo-rithms that are effective on real ar-chitectures.

—It is taking a long time to understandthe balance necessary between theperformance of different parts of aparallel computer and how this bal-ance affects performance. Careful con-trol of the relationship between pro-cessor speed and communicationinterconnect performance is necessaryfor good performance, and this mustalso be balanced with memory-hierar-chy performance. Historically, paral-lel computers have failed to delivermore than a small fraction of theirapparently achievable performance,and it has taken several generationsof using a particular architecture tolearn the lessons on balance.

—Parallel computer manufacturershave designated high-performancescientific and numerical computing astheir market, rather than the muchlarger high-effectiveness commercialmarket. The high-performance mar-ket has always been small, and hastended to be oriented towards mili-tary applications. Recent worldevents have seen this market dwin-dle, with predictable consequences forthe profitability of parallel computermanufacturers. The small market forparallel computing has meant thatparallel computers are expensive, be-cause so few of them are sold, and hasincreased the risk for both manufac-turers and users, further dampeningenthusiasm for parallelism.

—The execution time of a sequentialprogram changes by no more than aconstant factor when it is moved fromone uniprocessor to another. This isnot true for a parallel program, whoseexecution time can change by an or-der of magnitude when it is moved

Parallel Computation • 125


across architecture families. The fun-damental nonlocal nature of a paral-lel program requires it to interactwith a communication structure, andthe cost of each communication de-pends heavily on how both programand interconnect are arranged andwhat technology is used to implementthe interconnect. Portability is there-fore a much more serious issue inparallel programming than in sequen-tial. Transferring a software systemfrom one parallel architecture to an-other may require an amount of workup to and including rebuilding thesoftware completely. For fundamentalreasons, there is unlikely ever to beone best architecture family, indepen-dent of technological changes. There-fore parallel software users must ex-pect continual changes in theircomputing platforms, which at themoment implies continual redesignand rebuilding of software. The lackof a long-term growth path for paral-lel software systems is perhaps themajor reason for the failure of parallelcomputation to become mainstream.

Approaches to parallelism have beendriven either from the bottom, by thetechnological possibilities, or from thetop, by theoretical elegance. We arguethat the most progress so far, and thebest hope for the future, lies in drivingdevelopments from the middle, attack-ing the problem at the level of the modelthat acts as an interface between soft-ware and hardware issues.

In the next section, we review basicconcepts of parallel computing. In Sec-tion 3, we define the concept of a modeland construct a checklist of propertiesthat a model should have to provide anappropriate interface between softwareand architectures. In Section 4, we as-sess a large number of existing modelsusing these properties, beginning withthose that are most abstract and work-ing down to those that are very con-crete. We show that several modelsraise the possibility of both long-termportability and performance. This sug-

gests a way to provide the missinggrowth path for parallel software devel-opment and hence a mainstream paral-lel computing industry.

2. BASIC CONCEPTS OF PARALLELISM

In this section we briefly review some ofthe essential concepts of parallel com-puters and parallel software. We beginby considering the components of paral-lel computers.

Parallel computers consist of threebuilding blocks: processors, memorymodules, and an interconnection net-work. There has been steady develop-ment of the sophistication of each ofthese building blocks, but it is theirarrangement that most differentiatesone parallel computer from another.The processors used in parallel comput-ers are increasingly exactly the same asprocessors used in single-processor sys-tems. Present technology, however,makes it possible to fit more onto a chipthan just a single processor, so there isconsiderable investigation into whatcomponents give the greatest addedvalue if included on-chip with a proces-sor. Some of these, such as communica-tion interfaces, are relevant to parallelcomputing.

The interconnection network connectsthe processors to each other and some-times to memory modules as well. Themajor distinction between variants ofthe multiple-instruction multiple-data(MIMD) architectures is whether eachprocessor has its own local memory, andaccesses values in other memories usingthe network; or whether the intercon-nection network connects all processorsto memory. These alternatives arecalled distributed-memory MIMD andshared-memory MIMD, respectively andare illustrated in Figure 1.

Distributed-memory MIMD architec-tures can be further differentiated bythe total capacity of their interconnec-tion networks, that is, the total volumeof data that can be in transit in thenetwork at any one time. For example,an architecture whose processor-mem-



ory pairs (sometimes called processingelements) are connected by a mesh re-quires the same number of connectionsto the network for each processor, nomatter how large the parallel computerof which it is a member. The total ca-pacity of the network grows linearlywith the number of processors in thecomputer. On the other hand, an archi-tecture whose interconnection networkis a hypercube requires the number ofconnections per processor to be a loga-rithmic function of the total size of thecomputer. The total network capacitygrows faster than linearly in the num-ber of processors.

Another important style of parallelcomputer is the single-instruction mul-tiple-data (SIMD) class. Here a singleprocessor executes a single instructionstream, but broadcasts each instructionto be executed to a number of dataprocessors. These data processors inter-pret the instruction’s addresses eitheras local addresses in their own localmemories or as global addresses, per-haps modified by adding a local baseaddress to them.

We now turn to the terminology ofparallel software. The code executing ina single processor of a parallel computeris in an environment that is quite simi-lar to that of a processor running in amultiprogrammed single-processor sys-tem. Thus we speak of processes ortasks to describe code executing insidean operating-system-protected region ofmemory. Because many of the actions ofa parallel program involve communicat-ing with remote processors or memory

locations, which takes time, most pro-cessors execute more than one processat a time. Thus all of the standard tech-niques of multiprogramming apply: pro-cesses become descheduled when theydo something involving a remote com-munication, and are made ready for ex-ecution when a suitable response is re-ceived. A useful distinction is betweenthe virtual parallelism of a program,the number of logically independentprocesses it contains, and the physicalparallelism, the number of processesthat can be active simultaneously(which is, of course, equal to the num-ber of processors in the executing paral-lel computer).

Because of the number of communica-tion actions that occur in a typical par-allel program, processes are interruptedmore often than in a sequential en-vironment. Process manipulation isexpensive in a multiprogrammed envi-ronment so, increasingly, parallel com-puters use threads rather than pro-cesses. Threads do not have their ownoperating-system-protected memory re-gion. As a result there is much lesscontext to save when a context switchoccurs. Using threads is made safe bymaking the compiler responsible for en-forcing their interaction, which is possi-ble because all of the threads come froma single parallel program.

Processes communicate in a numberof different ways, constrained, of course,by what is possible in the executingarchitecture. The three main ways areas follows.

Fig. 1. Distributed-memory MIMD and shared-memory MIMD architectures.



—Message passing. The sending processpackages the message with a headerindicating to which processor and pro-cess the data are to be routed, andinserts them into the interconnectionnetwork. Once the message has beenpassed to the network, the sendingprocess can continue. This kind ofsend is called a nonblocking send. Thereceiving process must be aware thatit is expecting data. It indicates itsreadiness to receive a message by ex-ecuting a receive operation. If the ex-pected data have not yet arrived, thereceiving process suspends (blocks)until they do.

—Transfers through shared memory. Inshared-memory architectures, pro-cesses communicate by having thesending process place values in desig-nated locations, from which the re-ceiving process can read them. Theactual process of communication isthus straightforward. What is diffi-cult is detecting when it is safe eitherto put a value into the location or toremove it. Standard operating-systemtechniques such as semaphores orlocks may be used for this purpose.However, this is expensive and com-plicates programming. Some architec-tures provide full/empty bits associ-ated with each word of sharedmemory that provide a lightweightand high-performance way of synchro-nizing senders and receivers.

—Direct remote-memory access. Earlydistributed-memory architectures re-quired the processor to be interruptedevery time a request was receivedfrom the network. This is poor use ofthe processor and so, increasingly,distributed-memory architectures usea pair of processors in each processingelement. One, the application proces-sor, does the program’s computation;the other, the messaging processor,handles traffic to and from the net-work. Taken to the limit, this makesit possible to treat message passing asdirect remote memory access to thememories of other processors. This is

a hybrid form of communication inthat it applies to distributed-memoryarchitectures but has many of theproperties of shared memory.

These communication mechanisms neednot correspond directly to what the ar-chitecture provides. It is straightfor-ward to simulate message passing usingshared memory, and possible to simu-late shared memory using messagepassing (an approach known as virtualshared memory).

The interconnection network of a par-allel computer is the mechanism thatallows processors to communicate withone another and with memory modules.The topology of this network is the com-plete arrangement of individual linksbetween processing elements. It is natu-rally represented as a graph. The diam-eter of the interconnection network’s to-pology is the maximum number of linksthat must be traversed between anypair of processing elements. It forms alower bound for the worst-case latencyof the network, that is, the longest timerequired for any pair of processing ele-ments to communicate. In general, theobserved latency is greater than thatimplied by the topology because of con-gestion in the network itself.

The performance of a parallel pro-gram is usually expressed in terms ofits execution time. This depends on thespeed of the individual processors, butalso on the arrangement of communica-tion and the ability of the interconnec-tion network to deliver it. When wespeak of the cost of a program, we usu-ally mean its execution-time cost. How-ever, in the context of software develop-ment, cost may also include theresources necessary to develop andmaintain a program.

3. MODELS AND THEIR PROPERTIES

A model of parallel computation is aninterface separating high-level proper-ties from low-level ones. More con-cretely, a model is an abstract machineproviding certain operations to the pro-



gramming level above and requiring im-plementations for each of these opera-tions on all of the architectures below.It is designed to separate software-de-velopment concerns from effective par-allel-execution concerns and providesboth abstraction and stability. Abstrac-tion arises because the operations thatthe model provides are higher-levelthan those of the underlying architec-tures, simplifying the structure of soft-ware and reducing the difficulty of itsconstruction. Stability arises becausesoftware construction can assume astandard interface over long timeframes, regardless of developments inparallel computer architecture. At thesame time, the model forms a fixedstarting point for the implementationeffort (transformation system, compiler,and run-time system) directed at eachparallel computer. The model thereforeinsulates those issues that are the con-cern of software developers from thosethat are the concern of implementers.Furthermore, implementation decisions,and the work they require, are madeonce for each target rather than once foreach program.

Since a model is just an abstract ma-chine, models exist at many differentlevels of abstraction. For example, everyprogramming language is a model inour sense, since each provides somesimplified view of the underlying hard-ware. This makes it hard to comparemodels neatly because of the range oflevels of abstraction involved and be-cause many high-level models can beemulated by other lower-level models.There is not even a one-to-one connec-tion between models: a low-level modelcan naturally emulate several differenthigher-level ones, and a high-levelmodel can be naturally emulated by dif-ferent low-level ones. We do not explic-itly distinguish between programminglanguages and more abstract models(such as asynchronous order-preservingmessage passing) in what follows.

An executing parallel program is anextremely complex object. Consider aprogram running on a 100-processor

system, large but not unusual today.There are 100 active threads at anygiven moment. To conceal the latency ofcommunication and memory access,each processor is probably multiplexingseveral threads, so the number of activethreads is several times larger (say,300). Any thread may communicatewith any of the other threads, and thiscommunication may be asynchronous ormay require a synchronization with thedestination thread. So there are up to3002 possible interactions “in progress”at any instant. The state of such a pro-gram is very large. The program thatgives rise to this executing entity mustbe significantly more abstract than adescription of the entity itself if it is tobe manageable by humans. To put itanother way, a great deal of the actualarrangement of the executing computa-tion ought to be implicit and capable ofbeing inferred from its static description(the program), rather than having to bestated explicitly. This implies that mod-els for parallel computation requirehigh levels of abstraction, much higherthan for sequential programming. It isstill (just) conceivable to construct mod-estly sized sequential programs in as-sembly code, although the newest se-quential architectures make thisincreasingly difficult. It is probably im-possible to write a modestly sizedMIMD parallel program for 100 proces-sors in assembly code in a cost-effectiveway.

Furthermore, the detailed executionbehavior of a particular program on anarchitecture of one style is likely to bevery different from the detailed execu-tion on another. Thus abstractions thatconceal the differences between archi-tecture families are necessary.

On the other hand, a model that isabstract is not of great practical interestif an efficient method for executing pro-grams written in it cannot be found.Thus models must not be so abstractthat it is intellectually, or even compu-tationally, expensive to find a way toexecute them with reasonable efficiencyon a large number of parallel architec-



tures. A model, to be useful, must ad-dress both issues, abstraction and effec-tiveness, as summarized in thefollowing set of requirements [Skillicorn1994b]. A good model of parallel compu-tation should have the following proper-ties.

(1) Easy to Program. Because an exe-cuting program is such a complex object,a model must hide most of the detailsfrom programmers if they are to be ableto manage, intellectually, the creationof software. As much as possible of theexact structure of the executing pro-gram should be inserted by the transla-tion mechanism (compiler and run-timesystem) rather than by the program-mer. This implies that a model shouldconceal the following.

—Decomposition of a program into par-allel threads. The program must bedivided up into the pieces that willexecute on distinct processors. Thisrequires separating the code and datastructures into a potentially largenumber of pieces.

—Mapping of threads to processors.Once the program has been dividedinto pieces, a choice must be madeabout which piece is placed on whichprocessor. The placement decision isoften influenced by the amount ofcommunication taking place betweeneach pair of pieces, ensuring thatpieces that communicate a lot areplaced near each other in the inter-connection network. It may also benecessary to ensure that particularpieces are mapped to processors thathave some special hardware capabil-ity, for example, a high-performancefloating-point functional unit.

—Communication among threads.Whenever nonlocal data are required,a communication action of some kindmust be generated to move the data.Its exact form depends heavily on thedesignated architecture, but the pro-cesses at both ends must arrange totreat it consistently so that one pro-cess does not wait for data that nevercome.

—Synchronization among threads.There will be times during the compu-tation when a pair of threads, or evena larger group, must know that theyhave jointly reached a common state.Again the exact mechanism used istarget-dependent. There is enormouspotential for deadlock in the interac-tion between communication and syn-chronization.

Optimal decomposition and mappingare known to be exponentially expen-sive to compute. The nature of commu-nication also introduces complexity.Communication actions always involvedependencies, because the receiving endmust be prepared to block and thesender often does too. This means that achain of otherwise unexceptional com-munications among n processes can beconverted to a deadlocked cycle by theactions of a single additional process(containing, say, a receiving action fromthe last process in the chain before asending action to the first). Hence cor-rectly placing communication actionsrequires awareness of an arbitrarilylarge fraction of the global state of thecomputation. Experience has shownthat deadlock in parallel programs isindeed easy to create and difficult toremove. Requiring humans to under-stand programs at this level of detaileffectively rules out scalable parallelprogramming.

Thus models ought to be as abstractand simple as possible. There should beas close a fit as possible between thenatural way in which to express theprogram and that demanded by the pro-gramming language. For many pro-grams, this may mean that parallelismis not even made explicit in the programtext. For applications that are naturallyexpressed in a concurrent way, it im-plies that the apparent parallel struc-ture of the program need not be relatedto the actual way in which parallelismis exploited at execution.

(2) Software Development Methodol-ogy. The previous requirement implies alarge gap between the information pro-



vided by the programmer about the se-mantic structure of the program, andthe detailed structure required to exe-cute it. Bridging it requires a firm se-mantic foundation on which transforma-tion techniques can be built. Ad hoccompilation techniques cannot be ex-pected to work on problems of this com-plexity.

There is another large gap betweenspecifications and programs that mustalso be addressed by firm semanticfoundations. Existing sequential soft-ware is, with few exceptions, built usingstandard building blocks and algo-rithms. The correctness of such pro-grams is almost never properly estab-lished; rather they are subjected tovarious test regimes, designed to in-crease confidence in the absence of di-sastrous failure modes. This methodol-ogy of testing and debugging does notextend to portable parallel program-ming for two major reasons: the newdegree of freedom created by partition-ing and mapping hugely increases thestate space that must be tested—debug-ging thus requires interacting with thisstate space, in which even simple check-points are difficult to construct; and theprogrammer is unlikely to have accessto more than a few of the designatedarchitectures on which the program willeventually execute, and therefore can-not even begin to test the software onother architectures. There are also anumber of more mundane reasons: forexample, reproducibility on systemswhose communication networks are ef-fectively nondeterministic, and thesheer volume of data that testing gener-ates. Verification of program propertiesafter construction also seems too un-wieldy for practical use. Thus only aprocess aiming to build software that iscorrect by construction can work in thelong term. Such calculational ap-proaches have been advocated for se-quential programming, and they are notyet workable in the large there. Never-theless, they seem essential for parallelprogramming, even if they must remain

goals rather than practice in the me-dium term.

A great deal of the parallel softwareproduced so far has been of a numericalor scientific kind, and developmentmethodology issues have not been a pri-ority. There are several reasons for this.Much numerical computing is essen-tially linear algebra, and so programstructures are often both regular andnaturally related to the mathematicsfrom which they derive. Thus such soft-ware is relatively simpler than newercommercial applications. Second, long-term maintainability is emphasized lessbecause of the research nature of suchsoftware. Many programs are intendedfor short-term use, generating resultsthat make themselves obsolete by sug-gesting new problems to be attackedand new techniques to be used. By con-trast, most commercial software mustrepay its development costs out of thebusiness edge it creates.

(3) Architecture-Independent. Themodel should be architecture-indepen-dent, so that programs can be migratedfrom parallel computer to parallel com-puter without having to be redevelopedor indeed modified in any nontrivialway. This requirement is essential topermit a widespread software industryfor parallel computers.

Computer architectures have compar-atively short life spans because of thespeed with which processor and inter-connection technology are developing.Users of parallel computing must beprepared to see their computers re-placed, perhaps every five years. Fur-thermore, it is unlikely that each newparallel computer will much resemblethe one that it replaces. Redevelopingsoftware more or less from scratchwhenever this happens is not cost-effec-tive, although this is usually what hap-pens today. If parallel computation is tobe useful, it must be possible to insulatesoftware from changes in the underly-ing parallel computer, even when thesechanges are substantial.

This requirement means that a modelmust abstract from the features of any



particular style of parallel computer.Such a requirement is easy to satisfy inisolation, since any sufficiently abstractmodel satisfies it, but is more difficultwith the other requirements.

(4) Easy to Understand. A modelshould be easy to understand and toteach, since otherwise it is impossible toeducate existing software developers touse it. If parallelism is to become amainstream part of computing, largenumbers of people have to become profi-cient in its use. If parallel programmingmodels are able to hide the complexitiesand offer an easy interface they have agreater chance of being accepted andused. Generally, easy-to-use tools withclear goals, even if minimal, are prefer-able to complex ones that are difficult touse.

These properties ensure that a modelforms an effective target for softwaredevelopment. However, this is not use-ful unless, at the same time, the modelcan be implemented effectively on arange of parallel architectures. Thus wehave some further requirements.

(5) Guaranteed Performance. A modelshould have guaranteed performanceover a useful variety of parallel archi-tectures. Note that this does not meanthat implementations must extract ev-ery last ounce of performance out of adesignated architecture. For most prob-lems, a level of performance as high aspossible on a given architecture is un-necessary, especially if obtained at theexpense of much higher developmentand maintenance costs. Implementa-tions should aim to preserve the orderof the apparent software complexity andkeep constants small.

Fundamental constraints on architec-tures, based on their communicationproperties, are now well understood. Ar-chitectures can be categorized by theirpower in the following sense: an archi-tecture is powerful if it can execute anarbitrary computation with the ex-pected performance, given the text ofthe program and basic properties of thedesignated architecture. The reason forreduced performance, in general, is con-

gestion, which arises not from individ-ual actions of the program but from thecollective behavior of these actions.Powerful architectures have the re-sources, in principle at least, to handlecongestion without having an impact onprogram performance; less powerfulones do not.

The most powerful architecture classconsists of shared-memory MIMD com-puters and distributed-memory MIMDcomputers whose total interconnectionnetwork capacity grows faster than thenumber of processors, at least as fast asp log p, where p is the number of pro-cessors. For such computers, an arbi-trary computation with parallelism ptaking time t can be executed in such away that the product pt (called thework) is preserved [Valiant 1990b]. Theapparent time of the abstract computa-tion cannot be preserved in a real imple-mentation since communication (andmemory access) imposes latencies, typi-cally proportional to the diameter of theinterconnection network. However, thetime dilation that this causes can becompensated for by using fewer proces-sors, multiplexing several threads of theoriginal program onto each one, andthus preserving the product of time andprocessors. There is a cost to this imple-mentation, but it is an indirect one—there must be more parallelism in theprogram than in the designated archi-tecture, a property known as parallelslackness.

Architectures in this richly connectedclass are powerful, but do not scale wellbecause an increasing proportion oftheir resources must be devoted to in-terconnection network hardware. Worsestill, the interconnection network is typ-ically the most customized part of thearchitecture, and therefore by far themost expensive part.

The second class of architectures aredistributed-memory MIMD computerswhose total interconnection network ca-pacity grows only linearly with thenumber of processors. Such computersare scalable because they require only aconstant number of communication



links per processor (and hence the localneighborhoods of processors are unaf-fected by scaling) and because a con-stant proportion of their resources isdevoted to interconnection networkhardware. Implementing arbitrary com-putations on such machines cannot beachieved without loss of performanceproportional to the diameter of the in-terconnection network. Computationstaking time t and p processors have anactual work cost of ptd (where d is thediameter of the interconnection net-work). What goes wrong in emulatingarbitrary computations on such archi-tectures is that, during any step, each ofthe p processors could generate a com-munication action. Since there is onlytotal capacity proportional to p in theinterconnection network, these commu-nications use its entire capacity for thenext d steps in the worst case. Commu-nication actions attempted within thiswindow of d steps can be avoided only ifthe entire program is slowed by a factorof d to compensate. Thus this class nec-essarily introduces reduced perfor-mance when executing computationsthat communicate heavily. Architec-tures in this class are scalable, but theyare not as powerful as those in theprevious class.

The third class of architectures areSIMD machines which, though scalable,emulate arbitrary computations poorly.This is because of their inability to domore than a small constant number ofdifferent actions on each step [Skillicorn1990].

Thus scalable architectures are notpowerful and powerful architectures arenot scalable. To guarantee performanceacross many architectures, these resultsimply that

—the amount of communication allowedin programs must be reduced by afactor proportional to the diameter ofrealistic parallel computers (i.e., by afactor of log p or =p); and

—computations must be made more reg-ular, so that processors do fewer dif-ferent operations at each moment, if

SIMD architectures are considered asviable designated architectures.

The amount of communication that aprogram carries out can be reduced intwo ways: either by reducing the num-ber of simultaneous communication ac-tions, or by reducing the distance thateach travels. It is attractive to thinkthat distance could always be reducedby clever mapping of threads to proces-sors, but this does not work for arbi-trary programs. Even heuristic algo-rithms whose goal is placement formaximum locality are expensive to exe-cute and cannot guarantee good results.Only models that limit the frequency ofcommunication, or are restrictedenough to make local placement easy tocompute, can guarantee performanceacross the full range of designated par-allel computers.

(6) Cost Measures. Any program’s de-sign is driven, more or less explicitly, byperformance concerns. Execution timeis the most important of these, but oth-ers such as processor utilization or evencost of development are also important.We describe these collectively as thecost of the program.

The interaction of cost measures withthe design process in sequential soft-ware construction is a relatively simpleone. Because any sequential machineexecutes with speed proportional to anyother, design decisions that change theasymptotic complexity of a program canbe made before any consideration of thecomputer on which it will eventuallyrun. When a target decision has beenmade, further changes can be made, butthey are of the nature of tuning, ratherthan algorithm choice. In other words,the construction process has two phases:decisions are made about algorithmsand the asymptotic costs of the programare affected; and decisions are madeabout arrangements of program textand only the constants in front of theasymptotic costs are affected [Knuth1976].

This neat division cannot be made forparallel software development, because



small changes in program text andchoice of designated computer are bothcapable of affecting the asymptotic costof a program. If real design decisionsare to be made, a model must make thecost of its operations visible during allstages of software development, beforeeither the exact arrangement of the pro-gram or the designated computer hasbeen decided. Intelligent design deci-sions rely on the ability to decide thatAlgorithm A is better than Algorithm Bfor a particular problem.

This is a difficult requirement for amodel, since it seems to violate the no-tion of an abstraction. We cannot hopeto determine the cost of a program with-out some information about the com-puter on which it will execute, but wemust insist that the information re-quired be minimal (since otherwise theactual computation of the cost is tootedious for practical use). We say that amodel has cost measures if it is possibleto determine the cost of a program fromits text, minimal designated computerproperties (at least the number of pro-cessors it has), and information aboutthe size, but not the values, of its input.This is essentially the same view of costthat is used in theoretical models ofparallel complexity such as the PRAM[Karp and Ramachandran 1990]. Thisrequirement is the most contentious ofall of them. It requires that models pro-vide predictable costs and that compil-ers do not optimize programs. This isnot the way in which most parallel soft-ware is regarded today, but we reiteratethat design is not possible without it.And without the ability to design, paral-lel software construction will remain ablack art rather than an engineeringdiscipline.

A further requirement on cost mea-sures is that they be well behaved withrespect to modularity. Modern softwareis almost always developed in pieces byseparate teams and it is important thateach team need only know details of theinterface between pieces. This meansthat it must be possible to give eachteam a resource budget such that the

overall cost goal is met if each teammeets its individual cost allocation. Thisimplies that the cost measures must becompositional, so that the cost of thewhole is easily computable from the costof its parts, and convex, so that it is notpossible to reduce the overall cost byincreasing the cost of one part. Naiveparallel cost measures fail to meet ei-ther of these requirements.

3.1 Implications

These requirements for a model arequite demanding and several subsets ofthem are strongly in tension with eachother. For example, abstract modelsmake it easy to build programs but hardto compile them to efficient code,whereas low-level models make it hardto build software but easy to implementit efficiently. We use these require-ments as a metric by which to classifyand assess models.

The level of abstraction that modelsprovide is used as the primary basis forcategorizing them. It acts as a surrogatefor simplicity of the model, since in anabstract model less needs to be saidabout details of thread structure andpoints at which communication and syn-chronization take place. Level of abstrac-tion also correlates well with quality ofsoftware development methodology sincecalculations on programs are most pow-erful when their semantics is clean.

The extent to which the structure ofprogram implementations is con-strained by the structure of the programtext is closely related to properties suchas guaranteed performance and costmeasures. There are three levels of con-straint on program structure.

—Models that allow dynamic process orthread structure cannot restrict com-munication. Because any new threadcan generate communication, evenmodels that restrict communicationwithin a particular syntactic blockcannot limit it over the whole pro-gram. Thus such models cannot guar-antee that the communication gener-



ated by a program will not overrunthe total communication capacity of adesignated parallel computer. Be-cause dynamic process creation in-volves decisions by the run-time sys-tem, it is also impossible to definecost measures that can be used dur-ing design. (This does not mean, ofcourse, that it is impossible to writeefficient programs in models with dy-namic process structure, or that it isimpossible to improve efficiency bygood design or clever compilationtechniques. All it means is that someprograms that can be written in themodel will perform poorly, and it isnot straightforward to detect whichones. Such programs could potentiallybe avoided by a programming disci-pline, but they cannot be ruled out onthe model’s own terms.)

—Models that have a static process orthread structure but do not imposesyntactic limits on communicationpermit programs to be written thatcan perform badly because of commu-nication overruns. On the other hand,it is usually possible to define costmeasures for them, since the overrunscan be statically predicted from theprogram structure.

—Models that have a static process orthread structure can guarantee per-formance by suitably limiting the fre-quency with which communication ac-tions can be written in the program.It follows that it is also possible todefine cost measures for them.

In summary, only when the structureof the program is static, and the amountof communication is bounded, does amodel both guarantee performance andallow that performance to be predictedby cost measures.

We use control of structure and com-munication as the secondary basis forcategorizing models. This choice of pri-orities for classification reflects ourview that parallel programming shouldaim to become a mainstream part ofcomputing. In specialized areas, some ofthese requirements may be less impor-

tant. For example, in the domain ofhigh-performance numerical computing,program execution times are often qua-dratic or even worse in the size of theproblem. In this setting, the perfor-mance penalty introduced by executionon a distributed-memory MIMD com-puter with a mesh topology, say, may beinsignificant compared to the flexibilityof an unrestricted-communication model.There will probably never be a modelthat satisfies all potential users of par-allelism. However, models that satisfymany of the preceding requirements aregood candidates for general-purposeparallelism, the application of parallel-ism across wide problem domains [Mc-Coll 1993a].

4. OVERVIEW OF MODELS

We now turn to assessing existing mod-els according to the criteria outlined inthe previous section. Most of these mod-els were not developed with the ambi-tious goal of general-purpose parallel-ism, so it is not a criticism to say thatsome of them fail to meet all of therequirements. Our goal is to provide apicture of the state of parallel program-ming today, but from the perspective ofseeing how far towards general-purposeparallelism it is reasonable to get.

We have not covered all models forparallel computation, but we have triedto include those that introduce signifi-cant ideas, together with some sense ofthe history of such models. We do notgive a complete description of eachmodel, but instead concentrate on theimportant features and provide compre-hensive references. Many of the mostimportant papers on programming mod-els and languages have been reprintedin Skillicorn and Talia [1994].

Models are presented in decreasingorder of abstraction, in the followingcategories.

(1) Models that abstract from parallel-ism completely. Such models de-scribe only the purpose of a programand not how it is to achieve this



purpose. Software developers do noteven need to know if the programthey build will execute in parallel.Such models are necessarily ab-stract and relatively simple, sinceprograms need be no more complexthan sequential ones.

(2) Models in which parallelism is madeexplicit, but decomposition of pro-grams into threads is implicit (andhence so is mapping, communica-tion, and synchronization). In suchmodels, software developers areaware that parallelism will be usedand must have expressed the poten-tial for it in programs, but they donot know how much parallelism willactually be applied at run-time.Such models often require programsto express the maximal parallelismpresent in the algorithm and thenthe implementation reduces that de-gree of parallelism to fit the desig-nated architecture, at the same timeworking out the implications formapping, communication, and syn-chronization.

(3) Models in which parallelism and de-composition must both be made ex-plicit, but mapping, communication,and synchronization are implicit.Such models require decisions to bemade about the breaking up ofavailable work into pieces, but theyrelieve the software developer of theimplications of such decisions.

(4) Models in which parallelism, decom-position, and mapping are explicit,but communication and synchroni-zation are implicit. Here the soft-ware developer must not only breakthe work up into pieces, but mustalso consider how best to place thepieces on the designated processor.Since locality often has a markedeffect on communication perfor-mance, this almost inevitably re-quires an awareness of the desig-nated processor’s interconnectionnetwork. It becomes hard to makesuch software portable across differ-ent architectures.

(5) Models in which parallelism, decom-position, mapping, and communica-tion are explicit, but synchroniza-tion is implicit. Here the softwaredeveloper is making almost all ofthe implementation decisions, ex-cept that fine-scale timing decisionsare avoided by having the systemdeal with synchronization.

(6) Models in which everything is ex-plicit. Here software developersmust specify all of the detail of theimplementation. As we noted ear-lier, it is extremely difficult to buildsoftware using such models, becauseboth correctness and performancecan only be achieved by attention tovast numbers of details.

Within each of these categories, wepresent models according to their de-gree of control over structure and com-munication, in these categories:

—models in which thread structure isdynamic;

—models in which thread structure isstatic but communication is not lim-ited; and

—models in which thread structure isstatic and communication is limited.

Within each of these categories wepresent models based on their commonparadigms. Figures 2 and 3 show a clas-sification of models for parallel compu-tation in this way.

4.1 Nothing Explicit

The best models of parallel computationfor programmers are those in whichthey need not be aware of parallelism atall. Hiding all the activities required toexecute a parallel computation meansthat software developers can carry overtheir existing skills and techniques forsequential software development. Ofcourse, such models are necessarily ab-stract, which makes the implementer’sjob difficult since the transformation,compilation, and run-time systems mustinfer all of the structure of the eventualprogram. This means deciding how the



specified computation is to be achieved,dividing it into appropriately sizedpieces for execution, mapping thosepieces, and scheduling all of the commu-nication and synchronization amongthem.

At one time it was widely believedthat automatic translation from ab-stract program to implementationmight be effective starting from an ordi-nary sequential imperative language.Although a great deal of work was in-vested in parallelizing compilers, theapproach was defeated by the complex-ity of determining whether some aspectof a program was essential or simply an

artifact of its sequential expression. Itis now acknowledged that a highly auto-mated translation process is only practi-cal if it begins from a carefully chosenmodel that is both abstract and expres-sive.

Inferring all of the details requiredfor an efficient and architecture-inde-pendent implementation is possible, butit is difficult and, at present, few suchmodels can guarantee performance.

We consider models at this high levelof abstraction in subcategories: thosethat permit dynamic process or threadstructure and communication, thosethat have static process or thread struc-

Nothing Explicit, Parallelism ImplicitDynamic Structure

Higher-order Functional-HaskellConcurrent Rewriting—OBJ, MaudeInterleaving—UnityImplicit Logic Languages—PPP, AND/OR,

REDUCE/OR, Opera, Palm, concurrentconstraint languages

Static StructureAlgorithmic Skeletons—P3L, Cole,

DarlingtonStatic and Communication-Limited Structure

Homomorphic Skeletons—Bird-MeertensFormalism

Cellular Processing Languages—Cellang,Carpet, CDL, Ceprol

CrystalParallelism Explicit, Decomposition Implicit

Dynamic StructureDataflow—Sisal, IdExplicit Logic Languages—Concurrent

Prolog, PARLOG, GHC, Delta-Prolog,Strand

MultilispStatic Structure

Data Parallelism Using Loops—Fortranvariants, Modula 3*

Data Parallelism on Types—pSETL, parallelsets, match and move, Gamma, PEI, APL,MOA, Nial and AT

Static and Communication-Limited StructureData-Specific Skeletons—scan, multiprefix,

paralations, dataparallel C, NESL,CamlFlight

Decomposition Explicit, Mapping ImplicitDynamic StructureStatic Structure

BSP, LogPStatic and Communication-Limited Structure.

Mapping Explicit, Communication ImplicitDynamic Structure

Coordination Languages—Linda, SDLNon-message Communication Languages—

ALMS, PCN, Compositional C11Virtual Shared MemoryAnnotated Functional Languages—ParalfRPC—DP, Cedar, Concurrent CLU, DP

Static StructureGraphical Languages—Enterprise, Parsec,

CodeContextual Coordination Languages—Ease,

ISETL-Linda, OpusStatic and Communication-Limited Structure

Communication SkeletonsCommunication Explicit, Synchronization

ImplicitDynamic Structure

Process Networks—Actors, ConcurrentAggregates, ActorSpace, Darwin

External OO—ABCL/1, ABCL/R, POOL-T,EPL, Emerald, Concurrent Smalltalk

Objects and Processes—Argus, Presto, NexusActive Messages—Movie

Static StructureProcess Networks—static dataflowInternal OO—Mentat

Static and Communication-Limited StructureSystolic Arrays—Alpha

Everything ExplicitDynamic Structure

Message Passing—PVM, MPIShared Memory—FORK, Java, thread

packagesRendezvous—Ada, SR, Concurrent C

Static StructureOccam

PRAM

Figure 2. Classification of models of parallel computation.



ture but unlimited communication, andthose that also limit the amount of com-munication in progress at any given mo-ment.

4.1.1 Dynamic Structure. A popularapproach to describing computations ina declarative way, in which the desiredresult is specified without saying howthat result is to be computed, is to use aset of functions and equations on them.The result of the computation is a solu-tion, usually a least fixed point, of theseequations. This is an attractive frame-work in which to develop software, forsuch programs are both abstract andamenable to formal reasoning by equa-tional substitution. The implementationproblem is then to find a mechanism tosolve such equations.

Higher-order functional programmingtreats functions as l-terms and com-putes their values using reduction inthe l-calculus, allowing them to bestored in data structures, passed as ar-guments, and returned as results. Anexample of a language that allows higher-order functions is Haskell [Hudak andFasel 1992]. Haskell also includes sev-eral typical features of functional pro-gramming such as user-defined types,lazy evaluation, pattern matching, andlist comprehensions. Furthermore,Haskell has a parallel functional I/Osystem and provides a module facility.

The actual technique used in higher-order functional languages for comput-ing function values is called graph re-duction [Peyton-Jones and Lester 1992].Functions are expressed as trees, withcommon subtrees for shared subfunc-tions (hence graphs). Computation rulesselect graph substructures, reduce themto simpler forms, and replace them inthe larger graph structure. When nofurther computation rules can be ap-plied, the graph that remains is theresult of the computation.

It is easy to see how the graph-reduc-tion approach can be parallelized inprinciple—rules can be applied to non-overlapping sections of the graph inde-pendently and hence simultaneously.

Thus multiple processors can search forreducible parts of the graph indepen-dently and in a way that depends onlyon the structure of the graph (and sodoes not have to be inferred by a com-piler beforehand). For example, if theexpression (exp1 p exp2), where exp1and exp2 are arbitrary expressions, is tobe evaluated, two threads may indepen-dently evaluate exp1 and exp2, so thattheir values are computed simulta-neously.

Unfortunately, this simple idea turnsout to be quite difficult to make workeffectively. First, only computationsthat contribute to the final resultshould be executed, since doing otherswastes resources and alters the seman-tics of the program if a nonessentialpiece fails to terminate. For example,most functional languages have someform of conditional like

if b(x) thenf(x)

elseg(x)

Clearly exactly one of the values of f(x)or g(x) is needed, but which one is notknown until the value of b(x) is known.So evaluating b(x) first prevents redun-dant work, but on the other handlengthens the critical path of the com-putation (compared to evaluating f(x)and g(x) speculatively). Things are evenworse if, say, f(x) fails to terminate, butonly for values of x for which b(x) isfalse. Now evaluating f(x) speculativelycauses the program not to terminate,whereas the other evaluation order doesnot.

It is difficult to find independent pro-gram pieces that are known to be re-quired to compute the final result with-out quite sophisticated analysis of theprogram as a whole. Also, the actualstructure of the graph changes dramat-ically during evaluation, so that it isdifficult to do load-balancing well and tohandle the spawning of new subtasksand communication effectively. Parallelgraph reduction has been a limited suc-cess for shared-memory distributed



computers, but its effectiveness for dis-tributed-memory computers is still un-known.1 Such models are simple andabstract and allow software develop-ment by transformation, but they do notguarantee performance. Much of whathappens during execution is determineddynamically by the run-time system sothat cost measures (in our sense) cannotpractically be provided.

Concurrent rewriting is a closely re-lated approach in which the rules forrewriting parts of programs are chosenin some other way. Once again, pro-grams are terms describing a desiredresult. They are rewritten by applying aset of rules to subterms repeatedly untilno further rules can be applied. Theresulting term is the result of the com-putation. The rule set is usually chosento be both terminating (there is no infi-nite sequence of rewrites) and confluent(applying rules to overlapping subtermsgets the same result in the end), so thatthe order and position where rules areapplied makes no difference to the finalresult. Some examples of such modelsare OBJ [Goguen and Winkler 1988;Goguen et al. 1993, 1994], a functionallanguage whose semantics is based onequational logic, and Maude.2 An exam-ple, based on one in Lincoln et al.[1994], gives the flavor of this approach.The following is a functional module forpolynomial differentiation, assumingthe existence of a module that repre-sents polynomials and the usual actionson them. The lines beginning with eqare rewrite rules, which should be fa-miliar from elementary calculus. Theline beginning with ceq is a conditionalrewrite rule.

fmod POLY-DER isprotecting POLYNOMIAL .op der : Var Poly 3 Poly .op der : Var Mon 3 Poly .var A : Int .

var N : NzNat .vars P Q : Poly .vars U V : Mon .eq der(P 1 Q) 5 der(P) 1 der(Q) .eq der(U . V) 5 (der(U) . V) 1 (U .

der(V)) .eq der(A p U) 5 A p der(U) .ceq der(X ˆ N) 5 N p (X ˆ (N 2 1)) if

N . 1 .eq der(X ˆ 1) 5 1 .eq der(A) 5 0 .

endfm

An expression such as

der(X ˆ 5 1 3 p X ˆ 4 1 7 p X ˆ 2)

can be computed in parallel becausethere are soon multiple places where arewrite rule can be applied. This simpleidea can be used to emulate many otherparallel computation models.

Models of this kind are simple andabstract and allow software develop-ment by transformation, but again theycannot guarantee performance and aretoo dynamic to allow useful cost mea-sures.

Interleaving is a third approach thatderives from multiprogramming ideasin operating systems via models of con-currency such as transition systems. Ifa computation can be expressed as a setof subcomputations that commute, thatis, it can be evaluated in any order andrepeatedly, then there is considerablefreedom for the implementing system todecide on the actual structure of theexecuting computation. It can be quitehard to express a computation in thisform, but it is made considerably easierby allowing each piece of the computa-tion to be protected by a guard, that is,a Boolean-valued expression. Informallyspeaking, the semantics of a program inthis form is that all the guards areevaluated and one or more subprogramswhose guards are true are then evalu-ated. When they have completed, thewhole process begins again. Guardscould determine the whole sequence ofthe computation, even sequentializing itby having guards of the form step 5 i,but the intent of the model is rather touse the weakest guards, and therefore,

1 See Darlington et al. [1987], Hudak and Fasel[1992], Kelly [1989], Peyton-Jones et al. [1987],Rabhi and Manson [1991], and Thompson [1992].2 See Lechner et al. [1995], Meseguer [1992], Me-seguer and Winkler [1991], and Winkler [1993].



say, the least about how the pieces areto be fit together.

This idea lies behind UNITY3 and analternative that considers independenceof statements more: action systems.4

UNITY (unbounded nondeterministic it-erative transformations) is both a com-putational model and a proof system. AUNITY program consists of declarationof variables, specification of their initialvalues, and a set of multiple-assign-ment statements. In each step of execu-tion, some assignment statement isselected nondeterministically and exe-cuted. For example, the following pro-gram,

Program Pinitially x50assign x:5 a(x) i x:5 b(x) i x:5 c(x)

end {P}

consists of three assignments that areselected nondeterministically and exe-cuted. The selection procedure obeys afairness rule: every assignment is exe-cuted infinitely often.

Like rewriting approaches, interleav-ing models are abstract and simple, butguaranteed performance seems unlikelyand cost measures are not possible.

Implicit logic languages exploit thefact that the resolution process of a logicquery contains many activities that canbe performed in parallel [Chassin deKergommeaux and Codognet 1994]. Themain types of inherent parallelism inlogic programs are OR parallelism andAND parallelism. OR parallelism is ex-ploited by unifying a subgoal with thehead of several clauses in parallel. Forinstance, if the subgoal to be solved is?-a(x) and the matching clauses are

a(x):- b(x). a(x):- c(x).

then OR parallelism is exploited by uni-fying, in parallel, the subgoal with thehead of each of the two clauses. For an

introduction to logic programming, seeLloyd [1984]. AND parallelism dividesthe computation of a goal into severalthreads, each of which solves a singlesubgoal in parallel. For instance, if thegoal to be solved is

?- a(x), b(x), c(x).

The subgoals a(x), b(x), and c(x) aresolved in parallel. Minor forms of paral-lelism are search parallelism and unifi-cation parallelism where parallelism isexploited, respectively, in the searchingof the clause database and in the unifi-cation procedure.

Implicit parallel logic languages pro-vide automatic decomposition of the ex-ecution tree of a logic program into anetwork of parallel threads. This isdone by the language support both bystatic analysis at compile time and atrun-time. No explicit annotations of theprogram are needed. Implicit logic mod-els include PPP [Fagin and Despain1990], the AND/OR process model [Con-ery 1987], the REDUCE/OR model [Kale1987], OPERA [Briat et al. 1991], andPALM [Cannataro et al. 1991]. Thesemodels differ in how they view parallel-ism and their designated architecturesare varied, but they are mainly de-signed to be implemented on distributed-memory MIMD machines [Talia 1994].To implement parallelism these modelsuse either thread-based or subtree-based strategies. In thread-based mod-els, each single goal is solved by start-ing a thread. In subtree-based models,the search tree is divided into severalsubtrees, with one thread associatedwith each subtree. These two differentapproaches correspond to differentgrain sizes: in thread-based models thegrain size is fine, whereas in the sub-tree-based models the parallelism grainsize is medium or coarse.

Like other approaches discussed inthis section, implicit parallel logic lan-guages are highly abstract. Thus it ishard to guarantee performance, al-though good performance may some-times be achieved. Cost measures can-

3 See Bickford [1994], Chandy and Misra [1988],Gerth and Pneuli [1989], and Radha and Muth-ukrishnan [1992].4 See Back [1989a,b] and Back and Sere [1989;1990].



not be provided because implicit logiclanguages are highly dynamic.

Constraint logic programming is animportant generalization of logic pro-gramming aimed at replacing the pat-tern-matching mechanism of unificationby a more general operation called con-straint satisfaction [Saraswat et al.1991]. In this environment, a constraintis a subset of the space of all possiblevalues that a variable of interest cantake. A programmer does not explicitlyuse parallel constructs in a program,but defines a set of constraints on vari-ables. This approach offers a frameworkfor dealing with domains other thanHerbrand terms, such as integers andBooleans [Lloyd 1984]. In concurrentconstraint logic programming, a compu-tation progresses by executing threadsthat concurrently communicate by plac-ing constraints in a global store andsynchronizes by checking that a con-straint is entailed by the store. Thecommunication patterns are dynamic,so that there is no predetermined set ofthreads with which a given thread in-teracts. Moreover, threads correspondto goal atoms, so they are activateddynamically during program execution.Concurrent constraint logic program-ming models include cc [Saraswat et al.1991], the CHIP CLP language [vanHentenryek 1989], and CLP [Jaffar andLassez 1987]. As in other parallel logicmodels, concurrent constraint lan-guages are too dynamic to allow practi-cal cost measures.

4.1.2 Static Structure. One way toinfer the structure to be used to com-pute an abstract program is to insistthat the abstract program be based onfundamental units or componentswhose implementations are predefined.In other words, programs are built byconnecting ready-made building blocks.This approach has the following naturaladvantages.

—The building blocks raise the level ofabstraction because they are the fun-damental units in which program-

mers work. They can hide an arbi-trary amount of internal complexity.

—The building blocks can be internallyparallel but composable sequentially,in which case programmers need notbe aware that they are programmingin parallel.

—The implementation of each buildingblock needs to be done only once foreach architecture. The implementa-tion can be done by specialists, andtime and energy can be devoted tomaking it efficient.

In the context of parallel programming,such building blocks have come to becalled skeletons [Cole 1989], and theyunderlie a number of important models.For example, a common parallel pro-gramming operation is to sum the ele-ments of a list. The arrangement ofcontrol and communication to do this isexactly the same as that for computingthe maximum element of a list and forseveral other similar operations. Ob-serving that these are all special casesof a reduction provides a new abstrac-tion for programmer and implementeralike. Furthermore, computing the max-imum element of an array or of a tree isnot very different from computing it fora list, so that the concept of a reductioncarries over to other potential applica-tions. Observing and classifying suchregularities is an important area of re-search in parallel programming today.One overview and classification of skel-etons is the Basel Algorithm Classifica-tion Scheme [Burkhart et al. 1993].

For the time being, we restrict ourattention to algorithmic skeletons, thosethat encapsulate control structures. Theidea is that each skeleton correspondsto some standard algorithm or algo-rithm fragment, and that these skele-tons can be composed sequentially. Soft-ware developers select the skeletonsthey want to use and put them together.The compiler or library writer chooseshow each encapsulated algorithm is im-plemented and how intra- and inter-skeleton parallelism are exploited foreach possible target architecture.



We briefly mention some of the mostimportant algorithmic skeleton ap-proaches. The Pisa Parallel Program-ming Language (P3L)5 uses a set ofalgorithmic skeletons that capture com-mon parallel programming paradigmssuch as pipelines, worker farms, andreductions. For example, in P3L workerfarms are modeled by means of the farmconstructor as follows.

farm P in (int data) out (int result)W in (data) out (result)result 5 f(data)end

end farm

When the skeleton is executed, a num-ber of workers W are executed in paral-lel with the two P processes (the emitterand the collector). Each worker executesthe function f¼ on its data partition.Similar skeletons were developed byCole [1989, 1990, 1992, 1994], who alsocomputed cost measures for them on aparallel architecture. Work of a similarsort, using skeletons for reduce andmap over pairs, pipelines, and farms, isalso being done by Darlington’s group atImperial College [Darlington et al.1993].

Algorithmic skeletons are simple andabstract. However, because programsmust be expressed as compositions ofthe skeletons provided, the expressive-ness of the abstract programming lan-

guage is an open question. None of theapproaches previously described ad-dresses this explicitly, nor is there anynatural way in which to develop algo-rithmic skeleton programs, either fromsome higher-level abstraction or di-rectly at the skeleton level. On the otherhand, guaranteed performance for skel-etons is possible if they are chosen withcare, and because of this, cost measurescan be provided.

4.1.3 Static and Communication-Lim-ited Structure. Some skeleton ap-proaches bound the amount of commu-nication that takes place, usuallybecause they incorporate awareness ofgeometric information.

One such model is homomorphic skel-etons based on data types, an approachthat developed from the Bird–Meertensformalism [Skillicorn 1994b]. The skele-tons in this model are based on particu-lar data types, one set for lists, one setfor arrays, one set for trees, and so on.All homomorphisms on a data type canbe expressed as an instance of a singlerecursive and highly parallel computa-tion pattern, so that the arrangement ofcomputation steps in an implementa-tion needs to be done only once for eachdatatype.

Consider the pattern of computationand communication shown in Figure 3.Any list homomorphism can be com-puted by appropriately substituting forf and g, where g must be associative.

5 See Baiardi et al. [1991] and Danelutto et al.[1991a,b; 1990].

Figure 3. A skeleton for computing arbitrary list homomorphisms.



For example,

summaximumlength

sort

f 5 id, g 5 1

f 5 id, g 5 binary maxf 5 K1, g 5 1~where K1 is

the function thatalways returns 1!

f 5 id, g 5 merge

Thus a template for scheduling the indi-vidual computations and communica-tions can be reused to compute manydifferent list homomorphisms by replac-ing the operations that are done as partof the template. Furthermore, this tem-plate can also be used to compute homo-morphisms on bags (multisets), withslightly weaker conditions on the opera-tions in the g slots—they must be com-mutative as well as associative.

The communication required for suchskeletons is deducible from the struc-ture of the data type, so each implemen-tation needs to construct an embeddingof this communication pattern in theinterconnection topology of each desig-nated computer. Very often the commu-nication requirements are mild; for ex-ample, it is easy to see that listhomomorphisms require only the exis-tence of a logarithmic-depth binary treein the designated architecture intercon-nection network. All communication canthen take place with nearest neighbors(and hence in constant time). Homomor-phic skeletons have been built for mostof the standard types: sets and bags[Skillicorn 1994b], lists [Bird 1987; Mal-colm 1990; Spivey 1989], trees [Gibbons1991], arrays [Banger 1992, 1995], mol-ecules [Skillicorn 1994a], and graphs[Singh 1993].

The homomorphic skeleton approachis simple and abstract, and the methodof construction of data type homomor-phisms automatically generates a richenvironment for equational transforma-tion. The communication pattern re-quired for each type is known as thestandard topology for that type. Imple-mentations with guaranteed perfor-mance can be built for any designated

computers into whose interconnectiontopologies the standard topology can beembedded. Because the complete sched-ule of computation and communicationis determined in advance by the imple-menter, cost measures can be provided[Skillicorn and Cai 1995].

Cellular processing languages arebased on the execution model of cellularautomata. A cellular automaton con-sists of a possibly infinite n-dimensionallattice of cells. Each cell is connected toa limited set of adjacent cells. A cell hasa state chosen from a finite alphabet.The state of a cellular automaton iscompletely specified by the values of thevariables at each cell. The state of asingle cell is a simple or structured vari-able that takes values in a finite set.The states of all cells in the lattice areupdated simultaneously in discrete timesteps. Cells update their values by us-ing a transition function that takes asinput the current state of the local celland some limited collection of nearbycells that lie within some bounded dis-tance, known as a neighborhood. Simpleneighborhoods of a cell (C) in a 2-Dlattice are

NNCN

NNNN

NCN

NNN

N

NC

N

N

Cellular processing languages, suchas Cellang [Eckart 1992], CARPET[Spezzano and Talia 1997], CDL, andCEPROL [Seutter 1985], allow cellularalgorithms to be described by definingthe state of cells as a typed variable, ora record of typed variables, and a tran-sition function containing the evolutionrules of an automaton. Furthermore,they provide constructs for the defini-tion of the pattern of the cell neighbor-hood. These languages implement a cel-lular automaton as an SIMD or SPMDprogram, depending on the designatedarchitecture. In the SPMD (single pro-gram, multiple data) approach, cellularalgorithms are implemented as a collec-tion of medium-grain processes mappedonto different processing elements.



Each process executes the same pro-gram (the transition function) on differ-ent data (the state of cells). Thus all theprocesses obey, in parallel, the samelocal rule, which results in a globaltransformation of the whole automaton.Communication occurs only amongneighboring cells, so the communicationpattern is known statically. This allowsscalable implementations with guaran-teed performance, both on MIMD andSIMD parallel computers [Cannataro etal. 1995]. Moreover, cost measures canbe provided.

Another model that takes geometricarrangement explicitly into account isCrystal.6 Crystal is a functional lan-guage with added data types called in-dex domains to represent geometry,that is, locality and arrangement. Thefeature distinguishing Crystal fromother languages with geometric annota-tions is that index domains can betransformed and the transformationsreflected in the computational part ofprograms. Crystal is simple and ab-stract, and possesses a transformationsystem based both on its functional se-mantics and transformations of indexdomains. Index domains are a flexibleway of incorporating target interconnec-tion network topology into derivations,and Crystal provides a set of cost mea-sures to guide such derivations. A moreformal approach that is likely to lead tointeresting developments in this area isJay’s [1995] shapely types.

The key to an abstract model is thatthe details of the implementation mustsomehow be implicit in the text of eachprogram. We have seen two approaches.The first relies on the run-time systemto find and exploit parallelism and thesecond relies on predefined buildingblocks, knowledge of which is built intothe compiler so that it can replace theabstract program operations by imple-mentations, piece by piece.

4.2 Parallelism Explicit

The second major class of models is thatin which parallelism is explicit in ab-stract programs, but software develop-ers do not need to be explicit about howcomputations are to be divided intopieces and how those pieces are mappedto processors and communicate. Themain strategies for implementing de-composition both depend on making de-composition and mapping computation-ally possible and effective. The first is torenounce temporal and spatial localityand assume low-cost context switch, sothat the particular decomposition useddoes not affect performance much. Be-cause any decomposition is effective, asimple algorithm can be used to com-pute it. The second is to use skeletonsthat have a natural mapping to desig-nate processor topologies, skeletonsbased on the structure of the data thatthe program uses.

4.2.1 Dynamic Structure. Dataflow[Herath et al. 1987] expresses computa-tions as operations, which may in prin-ciple be of any size but are usuallysmall, with explicit inputs and results.The execution of these operations de-pends solely on their data dependen-cies—an operation is computed after allof its inputs have been computed, butthis moment is determined only at run-time. Operations that do not have amutual data dependency may be com-puted concurrently.

The operations of a dataflow programare considered to be connected by paths,expressing data dependencies, alongwhich data values flow. They can beconsidered, therefore, as collections offirst-order functions. Decomposition isimplicit, since the compiler can dividethe graph representing the computationin any way. The cut edges become theplaces where data move from one pro-cessor to another. Processors executeoperations in an order that dependssolely on those that are ready at anygiven moment. There is therefore notemporal context beyond the executionof each single operation, and hence no

6 See Creveuil [1991] and Yang and Choo [1991;1992a,b].



advantage to temporal locality. Becauseoperations with a direct dependence areexecuted at widely different times, pos-sibly even on different processors, thereis no advantage to spatial locality either.As a result, decomposition has little di-rect effect on performance (althoughsome caveats apply). Decomposition canbe done automatically by decomposingprograms into the smallest operationsand then clustering to get pieces of ap-propriate size for the designated archi-tecture’s processors. Even random allo-cation of operations to processorsperforms well on many dataflow sys-tems.

Communication is not made explicitin programs. Rather, the occurrence of aname as the result of an operation isassociated, by the compiler, with all ofthose places where the name is the in-put of an operation. Because there arereally no threads (threads contain onlya single instruction), communication iseffectively unsynchronized.

Dataflow languages have taken differ-ent approaches to expressing repetitiveoperations. Languages such as Id [Eka-nadham 1991] and Sisal [McGraw et al.1985; McGraw 1993; Skedzielewski 1991]are first-order functional (or single as-signment) languages. They have syntac-tic structures looking like loops thatcreate a new context for each executionof the “loop body” (so that they seemlike imperative languages except thateach variable name may be assigned toonly once in each context). For example,a Sisal loop with single-assignment se-mantics can be written as follows.

for i in 1, Nx :5 A[i] 1 B[i]returns value of sum x

end for

Sisal is the most abstract of the data-flow languages, because parallelism isonly visible in the sense that its loopsare expected to be opportunities for par-allelism. In this example, all of the loopbodies could be scheduled simulta-neously and then their results collected.

Dataflow languages are abstract and

simple, but they do not have a naturalsoftware development methodology.They can provide guaranteed perfor-mance; indeed, Sisal performs competi-tively with the best FORTRAN compil-ers on shared-memory architectures[McGraw 1993]. However, performanceon distributed-memory architectures isstill not competitive. Because so muchscheduling is done dynamically at run-time, cost measures are not possible.

Explicit logic languages are those inwhich programmers must specify theparallelism explicitly [Shapiro 1989].They are also called concurrent logiclanguages. Examples of languages inthis class are PARLOG [Gregory 1987],Delta-Prolog [Pereira and Nasr 1984],Concurrent Prolog [Shapiro 1986], GHC[Ueda 1985], and Strand [Foster andTaylor 1990].

Concurrent logic languages can beviewed as a new interpretation of Hornclauses, the process interpretation. Ac-cording to this interpretation, an atomicgoal 4 C can be viewed as a process, aconjunctive goal 4 C1, . . . , Cn as aprocess network, and a logic variableshared between two subgoals can beviewed as a communication channel be-tween two processes. The exploitation ofparallelism is achieved through the en-richment of a logic language like Prologwith a set of mechanisms for the anno-tation of programs. One of these mecha-nisms, for instance, is the annotation ofshared logical variables to ensure thatthey are instantiated by only one sub-goal. For example, the model of parallel-ism utilized by PARLOG and Concur-rent Prolog languages is based on theCSP (communicating sequential pro-cesses) model. In particular, communi-cation channels are implemented inPARLOG and Concurrent Prolog bymeans of logical variables shared be-tween two subgoals (e.g., p(X,Y), q(Y,Z)).Both languages use the guard conceptto handle nondeterminism in the sameway as it is used in CSP to delay com-munication between parallel processesuntil a commitment is reached.



A program in a concurrent logic lan-guage is a finite set of guarded clauses,

H 4 G1, G2, . . . , Gn u B1, B2, . . . , Bm.n,m $ 0,

where H is the clause head, the set Gi isthe guard, and Bi is the body of theclause. Operationally the guard is a testthat must be successfully evaluatedwith the head unification for the clauseto be selected. The symbol u, the commitoperator, is used as a conjunction be-tween the guard and the body. If theguard is empty, the commit operator isomitted.

The declarative reading of a guardedclause is: H is true if both Gi and Bi aretrue. According to the process interpre-tation, to solve H it is necessary to solvethe guard Gi, and if its resolution issuccessful, B1, B2, . . . , Bm are solvedin parallel.

These languages require program-mers to explicitly specify, using annota-tions, which clauses can be solved inparallel [Talia 1993]. For example, inPARLOG the ‘.’ and ‘;’ clause separatorscontrol the search for a candidateclause. Each group of ‘.’ separatedclauses are tried in parallel. The clausesfollowing a ‘;’ are tried only if all theclauses that precede the ‘;’ have beenfound to be noncandidate clauses. Forinstance, suppose that a relation is de-fined by a sequence of clauses

C1. C2; C3.

The clauses C1 and C2 are tested forcandidacy in parallel, but the clause C3is tested only if both C1 and C2 arefound to be noncandidate clauses. Al-though concurrent logic languages ex-tend the application areas of logic pro-gramming from artificial intelligence tosystem-level applications, program an-notations require a different style ofprogramming. They weaken the declar-ative nature of logic programming bymaking the exploitation of parallelismthe responsibility of the programmer.From the point of view of our classifica-tion, concurrent logic programs such as

PARLOG programs define parallel exe-cution explicitly by means of annota-tions. Furthermore, PARLOG allows dy-namic creation of threads to solvesubgoals but, because of the irregularstructure of programs, it does not limitcommunication among threads.

Another symbolic programming lan-guage in which parallelism is made ex-plicit by the programmer is Multilisp.The Multilisp [Halstead 1986] languageis an extension of Lisp in which oppor-tunities for parallelism are created usingfutures. In the language implementa-tion, there is a one-to-one correspon-dence between threads and futures. Theexpression (future X) returns a suspen-sion for the value of X and immediatelycreates a new process to evaluate X,allowing parallelism between the pro-cess computing a value and the processusing that value. When the value of X iscomputed, the value replaces the future.Futures give a model that representspartially computed values; this is espe-cially significant in symbolic processingwhere operations on structured data oc-cur often. An attempt to use the resultof a future suspends until the value hasbeen computed. Futures are first-classobjects and can be passed around re-gardless of their internal status. Thefuture construct creates a computationstyle much like that found in the data-flow model. In fact, futures allow eagerevaluation in a controlled way that fitsbetween the fine-grained eager evalua-tion of dataflow and the laziness ofhigher-order functional languages. Byusing futures we have dynamic threadcreation, and hence programs with dy-namic structure; moreover, the commu-nication of values among expressions isnot limited.

4.2.2 Static Structure. Turning tomodels with static structure, the skele-ton concept appears once more, but thistime based around single data struc-tures. At first glance it would seem thatmonolithic operations on objects of adata type, doing something to everyitem of a list or array, is a programming



model of limited expressiveness. How-ever, it turns out to be a powerful wayof describing many interesting algo-rithms.

Data parallelism arose historicallyfrom the attempt to use computationalpipelines. Algorithms were analyzed forsituations in which the same operationwas applied repeatedly to different dataand the separate applications did notinteract. Such situations exploit vectorprocessors to dramatically reduce thecontrol overhead of the repetition, sincepipeline stalls are guaranteed not tooccur because of the independence ofthe steps. With the development ofSIMD computers, it was quickly real-ized that vectorizable code is also SIMDcode, except that the independent com-putations proceed simultaneously in-stead of sequentially. SIMD code can beefficiently executed on MIMD comput-ers as well, so vectorizable code situa-tions can be usefully exploited by a widerange of parallel computers.

Such code often involves arrays andcan be seen more abstractly as in-stances of maps, the application of afunction to each element of a data struc-ture. Having made this abstraction, it isinteresting to ask what other operationsmight be useful to consider as appliedmonolithically to a data structure. Thusdata parallelism is a general approachin which programs are compositions ofsuch monolithic operations applied toobjects of a data type and producingresults of that same type.

We distinguish two approaches to de-scribing such parallelism: one based on(parallel) loops and the other based onmonolithic operations on data types.

Consider FORTRAN with the additionof a ForAll loop, in which iterations ofthe loop body are conceptually indepen-dent and can be executed concurrently.For example, a ForAll statement suchas

ForAll (I 5 1:N, J 5 1:M)A(I,J) 5 I p B(J)

on a parallel computer can be executedin parallel. Care must be taken to en-

sure that the loops do not reference thesame locations; for example, by indexingthe same element of an array via adifferent index expression. This cannotbe checked automatically in general, somost FORTRAN dialects of this kindplace the responsibility on the program-mer to make the check. Such loops aremaps, although not always over a singledata object.

Many FORTRAN dialects such asFORTRAN-D [Tseng 1993] and HighPerformance Fortran (HPF) [High Per-formance FORTRAN Language Specifi-cation 1993; Steele 1993] start from thiskind of parallelism and add more directdata parallelism by including constructsfor specifying how data structures are tobe allocated to processors, and opera-tions to carry out other data-paralleloperations, such as reductions. For ex-ample, HPF, a parallel language basedon FORTRAN-90, FORTRAN D, andSIMD FORTRAN, includes the Align di-rective to specify that certain data areto be distributed in the same way ascertain other data. For instance

!HPF$ Align X (:,:) with D (:,:)

aligns X with D, that is, ensures thatelements of X and D with the sameindices are placed on the same proces-sor. Furthermore, the Distribute direc-tive specifies a mapping of data to pro-cessors; for example,

!HPF$ Distribute D2 (Block, Block)

specifies that the processors are to beconsidered a two-dimensional array andthe points of D2 are to associate withprocessors in this array in a blockedfashion. HPF also offers a directive toinform the compiler that operations in aloop can be executed independently (inparallel). For example, the followingcode asserts that A and B do not sharememory space.

!HPF$ IndependentDo I 5 1, 1000A(I) 5 B(I)end Do



Other related languages are PandoreII [Andre et al. 1994; Andre andThomas 1989; Jerid et al. 1994] and C**[Larus et al. 1992]. This work is begin-ning to converge with skeleton ap-proaches: for example, Darlington’sgroup has developed a FORTRAN ex-tension that uses skeletons [Darlingtonet al. 1995]. Another similar approach isthe latest language in the Modula fam-ily, Modula 3* [Heinz 1993]. Modula 3*supports forall-style loops over datatypes in which each loop body executesindependently and the loop itself endswith a barrier synchronization. It iscompiled to an intermediate languagethat is similar in functionality to HPF.

Data-parallel languages based ondata types other than arrays have alsobeen developed. Some examples are:parallel SETL [Flynn-Hummel andKelly 1993; Hummel et al. 1991], paral-lel sets [Kilian 1992a,b], match andmove [Sheffler 1992], Gamma [Banatreand LeMetayer 1991; Creveuil 1991;Mussat 1991], and PEI [Violard 1994].Parallel SETL is an imperative-lookinglanguage with data-parallel operationson bags. For example, the inner state-ment of a matrix multiplication lookslike

c(i,j) :5 1/{a(i,k) p b(k,j) : k over {1..n}}.

The outer set of braces is a bag compre-hension, generating a bag containingthe values of the expression before thecolon for all values of k implied by thetext after the colon. The 1/ is a bagreduction or fold, summing these val-ues. Gamma is a language with data-parallel operations on finite sets. Forexample, the code to find the maximumelement of a set is

maxM :5 x:M, y:M 3 x:M 4 x $ y

which specifies that any pair of ele-ments x and y may be replaced in a setby the element x, provided the value inx is larger than the value in y.

There are also models based on arraysbut which derive from APL rather thanFORTRAN. These include Mathematics

of Arrays (MOA) [Mullin 1988], andNial and Array Theory [More 1986].

Data-parallel languages simplify pro-gramming because operations that re-quire loops in lower-level parallel lan-guages can be written as singleoperations (which are also more reveal-ing to the compiler since it does nothave to try to infer the pattern intendedby the programmer). With a sufficientlycareful choice of data-parallel opera-tions, some program transformation ca-pability is often achieved. The naturalmapping of data-parallel operations toarchitectures, at least for simple types,makes guaranteed performance, andalso cost measures, possible.

4.2.3 Static and Communication-Lim-ited Structure. The data-parallel lan-guages of the previous section were de-veloped primarily with programconstruction in mind. There is anotherset of similar languages whose inspira-tion was primarily architectural fea-tures. Because of these origins, theytypically pay more attention to theamount of communication that takesplace in computing each operation.Thus their thread structures are fixed,communication takes place at well-de-fined points, and the size of data in-volved in communications is small andfixed.

A wide variety of languages was de-veloped whose basic operations were da-ta-parallel list operations, inspired bythe architecture of the Connection Ma-chine 2. These often included a mapoperation, some form of reduction, per-haps using only a fixed set of operators,and later scans (parallel prefixes) andpermutation operations. In approxi-mately chronological order, these mod-els are: scan [Blelloch 1987], multiprefix[Ranade 1989], paralations [Goldman1989; Sabot 1989], the C* data-parallellanguage [Hatcher and Quinn 1991;Quinn and Hatcher 1990], the scan-vec-tor model and NESL [Blelloch 1990,1993, 1996; Blelloch and Sabot 1988,1990; Blelloch and Greiner 1994], andCamlFlight [Hains and Foisy 1993]. As



for other data-parallel languages, thesemodels are simple and fairly abstract.For instance, C* is an extension of the Clanguage that incorporates features ofthe SIMD parallel model. In C*, dataparallelism is implemented by definingdata of a parallel kind. C* programsmap variables of a particular data type,defined as parallel by the keyword poly,to separate processing elements. In thisway, each processing element executes,in parallel, the same statement for eachinstance of the specified data type. Atleast on some architectures, data-paral-lel languages provide implementationswith guaranteed performance and haveaccurate cost measures. Their weaknessis that the choice of operations is madeon the basis of what can be efficientlyimplemented, so that there is no basisfor a formal software developmentmethodology.

4.3 Decomposition Explicit

Models of this kind require abstract pro-grams to specify the pieces into whichthey are to be divided, but the place-ment of these pieces on processors andthe way in which they communicateneed not be described so explicitly.

4.3.1 Static Structure. The only ex-amples in this class are those that re-nounce locality, which ensures thatplacement does not matter to perfor-mance.

Bulk synchronous parallelism (BSP)7

is a model in which interconnection net-work properties are captured by a fewarchitectural parameters. A BSP ab-stract machine consists of a collection ofp abstract processors, each with localmemory, connected by an interconnec-tion network whose only properties ofinterest are the time to do a barriersynchronization (l) and the rate atwhich continuous randomly addresseddata can be delivered (g). These BSPparameters are determined experimen-tally for each parallel computer.

A BSP (abstract) program consists ofp threads and is divided into supersteps.Each superstep consists of: a computa-tion in each processor, using only locallyheld values; a global message transmis-sion from each processor to any set ofthe others; and a barrier synchroniza-tion. At the end of a superstep, theresults of global communications be-come visible in each processor’s localenvironment. A superstep is shown inFigure 4. If the maximum local compu-tation on a step takes time w and themaximum number of values sent by orreceived by any processor is h, then thetotal time for a superstep is given by

t 5 w 1 hg 1 l

7 See McColl [1994a,b; 1993a], Skillicorn et al.[1996], and Valiant [1989; 1990a,b].

Figure 4. A BSP superstep.



(where g and l are the network parame-ters), so that it is easy to determine thecost of a program. This time bound de-pends on randomizing the placement ofthreads and using randomized or adap-tive routing to bound communicationtime.

Thus BSP programs must be decom-posed into threads, but the placement ofthreads is then done automatically.Communication is implied by the place-ment of threads, and synchronizationtakes place across the whole program.The model is simple and fairly abstract,but lacks a software construction meth-odology. The cost measures give the realcost of a program on any architecturefor which g and l are known.

The current implementation of BSPuses an SPMD library that can be calledfrom C and FORTRAN. The library pro-vides operations to put data into thelocal memory of a remote process, to getdata from a remote process, and to syn-chronize. We illustrate with a small pro-gram to compute prefix sums:

int prefixsums(int x) {int i, left, right;bsp pushregister(&left,sizeof(int));bsp sync¼;right 5 x;for (i51;i,bsp nprocs¼;ip52) {

if (bsp pid¼1i , bsp nprocs¼)bsp put(bsp pid¼1i,&right,&left,

0,sizeof(int));bsp sync¼;if (bsp pid¼.5i) right 5 left 1

right;}bsp popregister(&left);return right;

}

The bsp-pushregister and bsp-popregis-ter calls are needed so that each processcan refer to variables in remote pro-cesses by name, even though they mighthave been allocated in heap or stackstorage.

Another related approach is LogP[Culler et al. 1993], which uses similarthreads with local contexts, updated byglobal communications. However, LogPdoes not have an overall barrier syn-

chronization. The LogP model is in-tended as an abstract model that cancapture the technological reality of par-allel computation. LogP models parallelcomputations using four parameters:the latency (L), overhead (o), band-width (g) of communication, and thenumber of processors (P). A set of pro-gramming examples has been designedwith the LogP model and implementedon the CM-5 parallel machine to evalu-ate the model’s usefulness. However,the LogP model is no more powerfulthan BSP [Bilardi et al. 1996], so BSP’ssimpler style is perhaps to be preferred.

4.4 Mapping Explicit

Models in this class require abstractprograms to specify how programs aredecomposed into pieces and how thesepieces are placed, but they provide someabstraction for the communication ac-tions among the pieces. The hardestpart about describing communication isthe necessity of labeling the two ends ofeach communication action to say thatthey belong together, and of ensuringthat communication actions are prop-erly matched. Given the number of com-munications in a large parallel pro-gram, this is a tedious burden forsoftware developers. All of the models inthis class try to reduce this burden bydecoupling the ends of the communica-tion from each other, by providing high-er-level abstractions for patterns ofcommunication, or by providing betterways of specifying communication.

4.4.1 Dynamic Structure. Coordina-tion languages simplify communicationby separating the computation aspectsof programs from their communicationaspects and providing a separate lan-guage in which to specify communica-tion. This separation makes the compu-tation and communication orthogonal toeach other, so that a particular coordi-nation style can be applied to any se-quential language.

The best known example is Linda[Ahuja et al., 1994; Carriero 1987; Car-



riero and Gelernter 1988, 1993], whichreplaces point-to-point communicationwith a large shared pool into which datavalues are placed by processes and fromwhich they are retrieved associatively.This shared pool is known as a tuplespace. The Linda communication modelcontains three communication opera-tions: in, which removes a tuple fromtuple space, based on its arity and thevalues of some of its fields, filling in theremaining fields from the retrievedtuple; read (rd), which does the sameexcept that it copies the tuple fromtuple space; and out, which places atuple in tuple space. For example, theread operation

rd(“Canada”, ?X, “USA”)

searches the tuple space for tuples ofthree elements, first element “Canada,”last element “USA,” and middle elementof the same type as variable X. Besidesthese three basic operations, Linda pro-vides the eval(t) operation that implic-itly creates a new process to evaluatethe tuple and insert it in the tuplespace.

The Linda operations decouple thesend and receive parts of a communica-tion—the “sending” thread does notknow the “receiving” thread, not even ifit exists. Although the model for findingtuples is associative matching, imple-mentations typically compile theseaway, based on patterns visible at com-pile time. The Linda model requiresprogrammers to manage the threads ofa program, but reduces the burden im-posed by managing communication. Un-fortunately, a tuple space cannot neces-sarily be implemented with guaranteedperformance, so that the model cannotprovide cost measures—worse, Lindaprograms can deadlock. Another impor-tant issue is a software developmentmethodology. To address this issue ahigh-level programming environment,called the Linda Program Builder(LPB), has been implemented to supportthe design and development of Lindaprograms [Ahmed 1994]. The LPB envi-ronment guides a user through program

design, coding, monitoring, and execu-tion of Linda software.

Nonmessage communication lan-guages reduce the overheads of manag-ing communication by disguising com-munication in ways that fit morenaturally into threads. For example,ALMS [Arbeit et al. 1993; Peierls andCampbell 1991] treats message passingas if the communication channels werememory-mapped. Reference to certainmessage variables in different threadsbehaves like a message transfer fromone to the others. PCN [Foster et al.1992, 1991; Foster and Tuecke 1991]and Compositional C11 also hide com-munication by single-use variables. Anattempt to read from one of these vari-ables blocks the thread if a value hasnot already been placed in it by anotherthread. These approaches are similar tothe use of full/empty bits on variables,an old idea coming back to prominencein multithreaded architectures.

In particular, the PCN (program com-position notation) language is based ontwo simple concepts, concurrent compo-sition and single-assignment variables.In PCN, single-assignment variablesare called definitional variables. Con-current composition allows parallel exe-cution of statement blocks to be speci-fied, without specifying, in theconcurrent composition, how each of thepieces of the composition is mapped toprocessors. Processes that share a defi-nitional variable can communicate witheach other through it. For instance, inthe parallel composition

{ i producer(X), consumer(X)}

the two processes producer and con-sumer can use X to communicate regard-less of their location on the parallelcomputer. In PCN, the mapping of pro-cesses to processors is specified by theprogrammer either by annotating pro-grams with location functions or by de-fining mappings of virtual to physicaltopologies.

The logical extension of mapping com-munication to memory is virtual sharedmemory, in which the abstraction pro-



vided to the program is of a single,shared address space, regardless of thereal arrangement of memory. This re-quires remote memory references eitherto be compiled into messages or to beeffected by messages at run-time. Sofar, results have not suggested that thisapproach is scalable, but it is an ongo-ing research area.8

Annotated functional languages makethe compiler’s job easier by allowingprogrammers to provide extra informa-tion about suitable ways to partition thecomputation into pieces and place them[Kelly 1989]. The same reduction rulesapply, so that the communication andsynchronization induced by this place-ment follow in the same way as in puregraph reduction.

An example of this kind of language isParalf [Hudak 1986]. Paralf is a func-tional language based on lazy evalua-tion; that is, an expression is evaluatedon demand. However, Paralf allows auser to control the evaluation order byexplicit annotations. In Paralf, commu-nication and synchronization are im-plicit, but it provides a mapping nota-tion to specify which expressions are tobe evaluated on which processor. Anexpression followed by the annotation$on proc is evaluated on the processoridentified by proc. For example, the ex-pression

(f(x) $on ($self11)) p (h(x) $on ($self))

denotes the computation of the f(x) sub-expression on a neighbor processor inparallel with the execution of h(x).

The remote procedure call (RPC)mechanism is an extension of the tradi-tional procedure call. An RPC is a pro-cedure call between two different pro-cesses, the caller and the receiver.When a process calls a remote proce-dure on another process, the receiverexecutes the code of the procedure andpasses back to the caller the outputparameters. Like rendezvous, RPC is a

synchronous cooperation form. Duringthe execution of the procedure, thecaller is blocked and is reactivated bythe arrival of the output parameters.Full synchronization of RPC might limitthe exploitation of a high degree of par-allelism among the processes that com-pose a concurrent program. In fact,when a process P calls a remote proce-dure r of a process T, the caller processP remains idle until the execution of rterminates, even if P could executesome other operation during the execu-tion of r. To partially limit this effect,most new RPC-based systems use light-weight threads. Languages based on theremote procedure call mechanism areDP [Hansen 1978], Cedar [Swinehart etal. 1985], and Concurrent CLU [Cooperand Hamilton 1988].

4.4.2 Static Structure. Graphical lan-guages simplify the description of com-munication by allowing it to be insertedgraphically and at a higher, structuredlevel. For example, the language Enter-prise [Lobe et al. 1992; Szafron et al.1991] classifies program units by typeand generates some of the communica-tion structure automatically based ontype. The metaphor is of an office, withsome program units communicatingonly through a “secretary,” for example.Parsec [Feldcamp and Wagner 1993] al-lows program units to be connected us-ing a set of predefined connection pat-terns. Code [Newton and Browne 1992]is a high-level dataflow language inwhich computations are connected to-gether graphically and a firing rule andresult-passing rule are associated witheach computation. Decomposition inthese models is still explicit, but com-munication is both more visible andsimpler to describe. The particular com-munication patterns available are cho-sen for applicability reasons rather thanefficiency, so performance is not guar-anteed, nor are cost measures.

Coordination languages with contextsextend the Linda idea. One of the weak-nesses of Linda is that it provides asingle global tuple space and thus pre-

8 See Chin and McColl [1994], Li and Hudak[1989], Raina [1990], and Wilkinson et al. [1992].



vents modular development of software.A model that extends Linda by includ-ing ideas from Occam is the languageEase [Zenith 1991a,b,c, 1992]. Ease pro-grams have multiple tuple spaces,which are called contexts and may bevisible to only some threads. Becausethose threads that access a particularcontext are known, contexts take onsome of the properties of Occam-likechannels. Threads read and write datato contexts as if they were Linda tuplespaces, with associative matching forreads and inputs. However, they canalso use a second set of primitives thatmove data to a context and relinquishownership of the data, or retrieve datafrom a context and remove them fromthe context. Such operations can usepass-by-reference since they guaranteethat the data will be referenced by onlyone thread at a time. Ease has many ofthe same properties as Linda, butmakes it easier to build implementa-tions with guaranteed performance.Ease also helps with decomposition byallowing process structuring in the styleof Occam.

Another related language is ISETL-Linda [Douglas et al. 1995], which is anextension to the SETL paradigm of com-puting with sets as aggregates. It addsLinda-style tuple spaces as a data typeand treats them as first-class objects. Toput it another way, ISETL-Linda resem-bles a data-parallel language in whichbags are a data type and associativematching is a selection operation on

bags. Thus ISETL-Linda can be seen asextending SETL-like languages with anew data type or as extending Linda-like languages with skeletons.

A language of the same kind derivedfrom FORTRAN is Opus [Mehrotra andHaines 1994], which has both task anddata parallelism, but communication ismediated by shared data abstractions.These are autonomous objects that arevisible to any subset of tasks but areinternally sequential; that is, only onemethod within each object is active at atime. They are a kind of generalizationof monitors.

4.4.3 Static and Communication-Lim-ited Structure. Communication skele-tons extend the idea of prestructuredbuilding blocks to communication [Skil-licorn 1996]. A communication skeletonis an interleaving of computation steps,which consist of independent local com-putations, and communication steps,which consist of fixed patterns of com-munication in an abstract topology.These patterns are collections of edge-disjoint paths in an abstract topology,each of which functions as a broadcastchannel. Figure 5 shows a communica-tion skeleton using two computationsteps, interleaved with two differentcommunication patterns. This model isa blend of ideas from BSP and fromalgorithmic skeletons, together withconcepts such as adaptive routing andbroadcast that are supported by newarchitectural designs. The model

Figure 5. A communication skeleton.



is moderately architecture-independentbecause communication skeletons canbe built assuming a weak topology tar-get and then embedding results can beused to build implementations for tar-gets with richer interconnection topolo-gies. It can provide guaranteed perfor-mance and does have cost measures.

4.5 Communication Explicit

Models in this class require communica-tion to be explicit, but reduce some ofthe burden of synchronization associ-ated with it. Usually this is done byhaving an asynchronous semantics:messages are delivered but the sendercannot depend on the time of delivery,and delivery of multiple messages maybe out of order.

4.5.1 Dynamic Structure. Processnets resemble dataflow in the sense thatoperations are independent entities thatrespond to the arrival of data by com-puting and possibly sending on otherdata. The primary differences are thatthe operations individually decide whattheir response to data arrival will be,and individually decide to change theirbehavior. They therefore lack the globalstate that exists, at least implicitly, indataflow computations.

The most important model in thisclass is actors [Agha 1986; Baude 1991;Baude and Vidal-Naquet 1991]. Actorsystems consist of collections of objectscalled actors, each of which has an in-coming message queue. An actor repeat-edly executes the sequence: read thenext incoming message, send messagesto other actors whose identity it knows,and define a new behavior that governsits response to the next message.Names of actors are first-class objectsand may be passed around in messages.Messages are delivered asynchronouslyand unordered. However, neither guar-anteed performance nor cost measuresfor actors is possible because the totalcommunication in an actor programmay not be bounded. Moreover, becausethe Actor model is highly distributed,

compilers must serialize execution toachieve execution efficiency on conven-tional processors. One of the most effec-tive compiler transformations is to elim-inate creation of some types of actorsand to change messages sent to actorson the same processor into functioncalls [Kim and Agha 1995]. Thus theactual cost of executing an actor pro-gram is indeterminate without a specifi-cation of how the actors are mapped toprocessors and threads.

A different kind of process net is pro-vided by the language Darwin [Eisen-bach and Patterson 1993; Radestockand Eisenbach 1994], which is based onthe p-calculus. The language provides asemantically well founded configurationsubset for specifying how ordinary pro-cesses are connected and how they com-municate. It is more like a process netthan a configuration language since thebinding of the semantics of communica-tion to connections is dynamic.

One of the weaknesses of the actormodel is that each actor processes itsmessage queue sequentially and thiscan lead to bottlenecks. Two extensionsof the model that address this issuehave been proposed: Concurrent Aggre-gates [Chien 1991; Chien and Dally1990] and ActorSpace [Agha andCallsen 1993]. Concurrent Aggregates(CA) is an object-oriented language wellsuited to exploit parallelism on fine-grain massively parallel computers. Init, unnecessary sources of serializationhave been avoided. An aggregate in CAis a homogeneous collection of objects(called representatives) that are groupedtogether and referenced by a single ag-gregate name. Each aggregate is multi-access, so it may receive several mes-sages simultaneously, unlike otherobject-oriented languages such as theactor model and ABCL/1. ConcurrentAggregates incorporates many other in-novative features such as delegation,intra-aggregate addressing, first-classmessages, and user continuations. Dele-gation allows the behavior of an aggre-gate to be constructed incrementallyfrom that of many other aggregates. In-



tra-aggregate addressing makes cooper-ation among parts of an aggregate pos-sible.

The ActorSpace model extends the ac-tor model to avoid unnecessary serial-izations. An actor space is a computa-tionally passive container of actors thatacts as a context for matching patterns.In fact, the ActorSpace model uses acommunication model based on destina-tion patterns. Patterns are matchedagainst listed attributes of actors andactor spaces that are visible in the actorspace. Messages can be sent to one arbi-trary member of a group or broadcast toall members of a group defined by apattern.

We now turn to external OO models.Actors are regarded as existing regard-less of their communication status. Asuperficially similar approach, but onewhich is quite different underneath, isto extend sequential object-oriented lan-guages so that more than one thread isactive at a time. The first way to dothis, which we have called external ob-ject-orientation, is to allow multiplethreads of control at the highest level ofthe language. Objects retain their tradi-tional role of collecting code that logi-cally belongs together. Object state cannow act as a communication mechanismsince it can be altered by a methodexecuted by one thread and observed bya method executed as part of anotherthread. The second approach, which wecall internal object orientation, encap-sulates parallelism within the methodsof an object, but the top level of thelanguage appears sequential. It is thusclosely related to data-parallelism. Wereturn to this second case later; here weconcentrate on external OO models andlanguages.

Some interesting external object-based models are ABCL/1 [Yonezawa etal. 1987], ABCL/R [Watanabe et al.1988], POOL-T [America 1987], EPL[Black et al. 1984], Emerald [Black etal. 1987], and Concurrent Smalltalk[Yokote et al. 1987]. In these languages,parallelism is based on assigning athread to each object and using asyn-

chronous message passing to reduceblocking. EPL is an object-based lan-guage that influenced the design of Em-erald. In Emerald, all entities are ob-jects that can be passive (data) oractive. Each object consists of fourparts: a name, a representation (data), aset of operations, and an optional pro-cess that can run in parallel with invo-cations of object operations. Active ob-jects in Emerald can be moved from oneprocessor to another. Such a move canbe initiated by the compiler or by theprogrammer using simple language con-structs. The primary design principlesof ABCL/1 (an object-based concurrentlanguage) are practicality and clear se-mantics of message passing. Threetypes of message passing are defined:past, now, and future. The now modeoperates synchronously, whereas thepast and future modes operate asyn-chronously. For each of the three mes-sage-passing mechanisms, ABCL/1 pro-vides two distinct modes, ordinary andexpress, which correspond to two differ-ent message queues. To give an exam-ple, past type message passing in ordi-nary and express modes is, respectively,

[Obj ,5 msg] and [Obj ,,5 msg],

where Obj is the receiver object and msgis the sent message.

In ABCL/1, independent objects canexecute in parallel but, as in the actormodel, messages are processed seriallywithin an object. Although messagepassing in ABCL/1 programs may takeplace concurrently, no more than onemessage can arrive at the same objectsimultaneously. This limits the parallel-ism among objects. An extension ofABCL/1 is ABCL/R where reflection hasbeen introduced.

Objects and processes exploit parallel-ism in external object-oriented lan-guages in two principal ways: using theobjects as the unit of parallelism byassigning one or more processes to eachobject, or defining processes as compo-nents of the language. In the first ap-proach, languages are based on activeobjects and each process is bound to a



particular object for which it is created.In the latter approach, two differentkinds of entities are defined, objects andprocesses. A process is not bound to asingle object, but is used to perform allthe operations required to satisfy anaction. Therefore, a process can executewithin many objects, changing its ad-dress space when an invocation to an-other object is made. Whereas the ob-ject-oriented models discussed beforeuse the first approach, systems like Ar-gus [Liskov 1987] and Presto [Bershadet al. 1988] use the second approach. Inthis case, languages provide mecha-nisms for creating and controlling mul-tiple processes external to the objectstructure.

Argus supports coarse-grain and me-dium-grain objects and dynamic processcreation. In Argus, guardians containdata objects and procedures. A guardianinstance is created dynamically by a callto a creator procedure and can be explic-itly mapped to a processor:

guardianType$creator(parameters)processor X.

The expense of dynamic process cre-ation is reduced by maintaining a poolof unused processes. A new group ofprocesses is created only when the poolis emptied. In these models, parallelismis implemented on top of the object or-ganization and explicit constructs aredefined to ensure object integrity. It isworth noticing that these models weredeveloped for programming coarse-grainprograms in distributed systems, nottightly coupled, fine-grain parallel ma-chines.

Active messages is an approach thatdecouples communication and synchro-nization by treating messages as activeobjects rather than passive data. Essen-tially a message consists of two parts: adata part and a code part that executeson the receiving processor when themessage has been transmitted. Thus amessage changes into a process when itarrives at its destination. There istherefore no synchronization with anyprocess at the receiving end, and hence

a message “send” does not have a corre-sponding “receive.” This approach isused in the Movie system [Faigle et al.1993] and the language environmentsfor the J-machine [Chien and Dally1990; Dally et al. 1992; Noakes andDally 1990].

4.5.2 Static Structure. Internal ob-ject-oriented languages are those inwhich parallelism occurs within singlemethods. The Mentat ProgrammingLanguage (MPL) is a parallel object-oriented system designed for developingarchitecture-independent parallel appli-cations. The Mentat system integrates adata-driven computation model with theobject-oriented paradigm. The data-driven model supports a high degree ofparallelism, whereas the object-orientedparadigm hides much of the parallelenvironment from the user. MPL is anextension of C11 that supports bothintra- and interobject parallelism. Thecompiler and the run-time support ofthe language are designed to achievehigh performance. The language con-structs are mapped to the macrodata-flow model that is the computationmodel underlying Mentat, which is amedium-grain data-driven model inwhich programs are represented as di-rected graphs. The vertices of the pro-gram graphs are computation elementsthat perform some function. The edgesmodel data dependencies between thecomputation elements. The compilergenerates code to construct and executedata dependency graphs. Thus interob-ject parallelism in Mentat is largelytransparent to the programmer. For ex-ample, suppose that A, B, C, D, E, andM are vectors and consider the state-ments:

A 5 vect op.add (B,C);M 5 vect op.add (A, vect op.add

(D,E));

The Mentat compiler and run-time sys-tem detect that the two additions (B 1C) and (D 1 E) are not data-dependenton one another and can be executed inparallel. Then the result is automati-



cally forwarded to the final addition.That result is forwarded to the callerand associated with M. In this ap-proach, the programmer makes granu-larity and partitioning decisions usingMentat class definition constructs, andthe compiler and the run-time supportmanage communication and synchroni-zation [Grimshaw 1993a,b, 1991; Grim-shaw et al. 1991a, b].

4.5.3 Static and Communication-Limited Structure. Systolic arrays aregridlike architectures of processing ele-ments or cells that process data in ann-dimensional pipelined fashion. Byanalogy with the systolic dynamics ofthe heart, systolic computers performoperations in a rhythmic, incremental,and repetitive manner [Kung 1982] andpass data to neighbor cells along one ormore directions. In particular, eachcomputing element computes an incre-mental result and the systolic computerderives the final result by interpretingthe incremental results from the entirearray. A parallel program for a systolicarray must specify how data aremapped onto the systolic elements andthe data flow through the elements.High-level programmable arrays allowthe development of systolic algorithmsby the definition of inter- and intracellparallelism and cell-to-cell data commu-nication. Clearly, the principle of rhyth-mic communication distinguishes systolicarrays from other parallel computers.However, even if high-level programma-bility of systolic arrays creates a moreflexible systolic architecture, penaltiescan occur because of complexity andpossible slowing of execution due to theproblem of data availability. High-levelprogramming models are necessary forpromoting widespread use of program-mable systolic arrays. One example isthe language Alpha [de Dinechin et al.1995], where programs are expressed asrecurrence equations. These are trans-formed into systolic form by regardingthe data dependencies as defining anaffine or vector space that can be geo-metrically transformed.

4.6 Everything Explicit

The next category of models is thosethat do not hide much detail of decom-position and communication. Most ofthe first-generation models of parallelcomputation are at this level, designedfor a single architecture style, explicitlymanaged.

4.6.1 Dynamic Structure. Most mod-els provide a particular paradigm forhandling partitioning, mapping, andcommunication. A few models havetried to be general enough to providemultiple paradigms, for example, Pi[Dally and Wills 1989; Wills 1990], byproviding sets of primitives for eachstyle of communication. Such modelscan have guaranteed performance andcost measures, but they make the taskof software construction difficult be-cause of the amount of detail that mustbe given about a computation. Anotherset of models of the same general kindare the programming languages Orca[Bal et al. 1990] and SR [Andrews andOlsson 1993; Andrews et al. 1988]. Orcais an object-based language that usesshared data-objects for interprocesscommunication. The Orca system is ahierarchically structured set of abstrac-tions. At the lowest level, reliablebroadcast is the basic primitive so thatwrites to a replicated structure can takeeffect rapidly throughout a system. Atthe next level of abstraction, shareddata are encapsulated in passive objectsthat are replicated throughout the sys-tem. Parallelism in Orca is expressedthrough explicit process creation. A newprocess can be created through the forkstatement

fork proc name (params)[on(cpu number)].

The on part optionally specifies the pro-cessor on which to run the child process.The parameters specify the shared dataobjects used for communication betweenthe parent and the child processes.

Synchronizing Resources (SR) isbased on the resource concept. A re-



source is a module that can containseveral processes. A resource is dynam-ically created by the create commandand its processes communicate by theuse of semaphores. Processes belongingto different resources communicate us-ing only a restricted set of operationsexplicitly defined in the program as pro-cedures.

There is a much larger set of modelsor programming languages based on asingle communication paradigm. Weconsider three paradigms: messagepassing, shared memory, and rendez-vous.

Message passing is the basic commu-nication technology provided on distrib-uted-memory MIMD architectures, andso message-passing systems are avail-able for all such machines. The inter-faces are low-level, using sends and re-ceives to specify the message to beexchanged, process identifier, and ad-dress.

It was quickly realized that message-passing systems look much the same forany distributed-memory architecture, soit was natural to build standard inter-faces to improve the portability of mes-sage-passing programs. The most recentexample of this is MPI (message passinginterface) [Dongarra et al. 1995; Mes-sage Passing Interface Forum 1993],which provides a rich set of messagingprimitives, including point-to-pointcommunication, broadcasting, and theability to collect processes in groups andcommunicate only within each group.MPI aims to become the standard mes-sage-passing interface for parallel appli-cations and libraries [Dongarra et al.1996]. Point-to-point communicationsare based on send and receive primitives

MPI Send (buf, bufsize, datatype,dest, ..... )

MPI Recv (buf, bufsize, datatype,source, ..... ).

Moreover, MPI provides primitives forcollective communication and synchro-nization such as MPI Barrier, MPI B-cast, and MPI Gather. In its first ver-

sion, MPI does not make provision forprocess creation, but in the MPI2 ver-sion additional features for active mes-sages, process startup, and dynamicprocess creation are provided.

More architecture-independent mes-sage-passing models have been devel-oped to allow transparent use of net-works of workstations. In principle,such networks have much unused com-pute power to be exploited. In practice,the large latencies involved in commu-nicating among workstations makethem low-performance parallel comput-ers. Models for workstation message-passing include systems such as PVM,9

Parmacs [Hempel 1991; Hempel et al.1992], and p4 [Butler and Lusk 1992].Such models are exactly the same asinter-multiprocessor message-passingsystems, except that they typically havemuch larger-grain processes to help con-ceal the latency, and they must addressheterogeneity of the processors. For ex-ample, PVM (parallel virtual machine)has gained widespread acceptance as aprogramming toolkit for heterogeneousdistributed computing. It provides a setof primitives for process creation andcommunication that can be incorporatedinto existing procedural languages inorder to implement parallel programs.In PVM, a process is created by thepvm spawn¼ call. For instance, thestatement

proc num 5 pvm spawn (“progr1”,NULL, PVMTaskDefault, 0, n proc)

spawns n proc copies of the programprogr1. The actual number of processesstarted is returned to proc num. Com-munication between two processes canbe implemented by the primitives

pvm send (proc id, msg) and pvm rec(proc id, msg).

For group communication and synchro-nization the functions pvm bcast¼,

9 See Beguelin et al. [1993; 1995; 1994], Geist[1993], and Sunderam [1990; 1992].



pvm mcast¼, pvm barrier¼ can beused.

Using PVM and similar models, pro-grammers must do all of the decomposi-tion, placement, and communication ex-plicitly. This is further complicated bythe need to deal with several differentoperating systems to communicate thisinformation to the messaging software.Such models may become more usefulwith the increasing use of optical inter-connection and high-performance net-works for connecting workstations.

Shared-memory communication is anatural extension of techniques used inoperating systems, but multiprogram-ming is replaced by true multiprocess-ing. Models for this paradigm are there-fore well understood. Some aspectschange in the parallel setting. On asingle processor it is never sensible tobusy-wait for a message, since this de-nies the processor to other processes; itmight be the best strategy on a parallelcomputer since it avoids the overhead oftwo context switches. Shared-memoryparallel computers typically providecommunication using standard para-digms such as shared variables andsemaphores. This model of computationis an attractive one since issues of de-composition and mapping are not im-portant. However, it is closely linked toa single style of architecture, so thatshared-memory programs are not porta-ble.

An important shared-memory pro-gramming language is Java [Lea 1996],which has become popular because of itsconnection with platform-independentsoftware delivery on the Web. Java isthread-based, and allows threads tocommunicate and synchronize usingcondition variables. Such shared vari-ables are accessed from within synchro-nized methods. A critical section enclos-ing the text of the methods isautomatically generated. These criticalsections are rather misleadingly calledmonitors. However, notify and wait oper-ations must be explicitly invoked withinsuch sections, rather than being auto-matically associated with entry and

exit. There are many other thread pack-ages available providing lightweightprocesses with shared-memory commu-nication.10

Rendezvous-based programming mod-els are distributed-memory paradigmsusing a particular cooperation mecha-nism. In the rendezvous communicationmodel, an interaction between two pro-cesses A and B takes place when A callsan entry of B and B executes an acceptfor that entry. An entry call is similar toa procedure call and an accept state-ment for the entry contains a list ofstatements to be executed when the en-try is called. The best known parallelprogramming languages based on ren-dezvous cooperation are Ada [Mundieand Fisher 1986] and Concurrent C [Ge-hani and Roome 1986]. Ada was de-signed on behalf of the US Departmentof Defense mainly to program real-timeapplications both on sequential and par-allel distributed computers. Parallelismin the Ada language is based on pro-cesses called tasks. A task can be cre-ated explicitly or can be statically de-clared. In this latter case, a task isactivated when the block containing itsdeclaration is entered. Tasks are com-posed of a specification part and a body.As discussed before, this mechanism isbased on entry declarations, entry calls,and accept statements. Entry declara-tions are allowed only in the specifica-tion part of a task. Accept statementsfor the entries appear in the body of atask. For example, the following acceptstatement executes the operation whenthe entry square is called.

accept SQUARE (X: INTEGER; Y: outINTEGER) doY :5 X p X;

end;

Other important features of Ada forparallel programming are the use of theselect statement, which is similar to the

10 See Buhr and Stroobosscher [1990], Buhr et al.[1991], Faust and Levy [1990], and Mukherjee etal. [1994].



CSP ALT command for expressing non-determinism, and the exception-han-dling mechanism for dealing with soft-ware failures. On the other hand, Adadoes not address the problem of map-ping tasks onto multiple processors anddoes not provide conditions to be associ-ated with the entry declarations, exceptfor some special cases such as protectedobjects. Recent surveys of such modelscan be found in Bal et al. [1989] andGottlieb et al. [1983a,b].

4.6.2 Static Structure. Most low-level models allow dynamic process cre-ation and communication. An exceptionis Occam [Jones and Goldsmith 1988],in which the process structure is fixedand communication takes place alongsynchronous channels. Occam programsare constructed from a small number ofprimitive constructs: assignment, input(?), and output (!). To design complexparallel processes, primitive constructscan be combined using the parallel con-structor

PARProc1Proc2

The two processes are executed in par-allel and the PAR constructor termi-nates only after all of its componentshave terminated. An alternative con-structor (ALT) implements nondetermin-ism. It waits for input from a number ofchannels and then executes the corre-sponding component process. For exam-ple, the following code

ALTrequest ? data

DataProcexec ? oper

ExecProc

waits to get a data request or an opera-tion request. The process correspondingto the selected guard is executed.

Occam has a strong semantic founda-tion in CSP [Hoare 1985], so that soft-ware development by transformation ispossible. However, it is so low-level that

this development process is only practi-cal for small or critical applications.

4.7 PRAM

A final model that must be considered isthe PRAM model [Karp and Ramachan-dran 1990], which is the basic model formuch theoretical analysis of parallelcomputation. The PRAM abstract ma-chine consists of a set of processors,capable of executing independent pro-grams but doing so synchronously, con-nected to a shared memory. All proces-sors can access any location in unittime, but they are forbidden to accessthe same location on the same step.

The PRAM model requires detaileddescriptions of computations, giving thecode for each processor and ensuringthat memory conflict is avoided. Theunit-time memory-access part of thecost model cannot be satisfied by anyscalable real machine, so the cost mea-sures of the PRAM model are not accu-rate. Nor can they be made accurate inany uniform way, because the real costof accessing memory for an algorithmdepends on the total number of accessesand the pattern in which they occur.One attempt to provide some abstrac-tion from the PRAM is the languageFORK [Kessler and Seidl 1995].

A good overview of models aimed atparticular architectures can be found inMcColl [1993b].

5. SUMMARY

We have presented an overview of par-allel programming models and lan-guages, using a set of six criteria thatan ideal model should satisfy. Four ofthe criteria relate to the need to use themodel as a target for software develop-ment. They are: ease of programmingand the existence of a methodology forconstructing software that handles is-sues such as correctness, independencefrom particular architectures, and sim-plicity and abstractness. The remainingcriteria address the need for executionof the model on real parallel machines:



guaranteed performance and the exis-tence of costs that can be inferred fromthe program. Together these ensure pre-dictable performance for programs.

We have assessed models by how wellthey satisfy these criteria, dividingthem into six classes, ranging from themost abstract, which generally satisfysoftware development criteria but notpredictable performance criteria, tovery concrete models, which providepredictable performance but make ithard to construct software.

The models we have described repre-sent an extremely wide variety of ap-proaches at many different levels. Over-all, some interesting trends are visible.

—Work on low-level models, in whichthe description of computations iscompletely explicit, has diminishedsignificantly. We regard this as a goodthing, since it shows that an aware-ness of the importance of abstractionhas spread beyond the research com-munity.

—There is a concentration on models inthe middle range of abstraction, witha great deal of ingenuity being ap-plied to concealing aspects of parallelcomputations while struggling to re-tain maximum expressiveness. This isalso a good thing, since tradeoffsamong expressiveness, software de-velopment complexity, and run-timeefficiency are subtle. Presumably ablend of theoretical analysis and prac-tical experimentation is the mostlikely road to success, and this strat-egy is being applied.

—There are some very abstract modelsthat also provide predictable and use-ful performance on a range of parallelarchitectures. Their existence raisesthe hope that models satisfying all ofthe properties with which we begancan eventually be constructed.

These trends show that parallel pro-gramming models are leaving low-levelapproaches and moving towards moreabstract approaches, in which lan-guages and tools simplify the task of

designers and programmers. At thesame time these trends provide for morerobust parallel software with predict-able performance.

This scenario brings many benefitsfor parallel software development. Mod-els, languages, and tools represent anintermediate level between users andparallel architectures and allow thesimple and effective utilization of paral-lel computation in many application ar-eas. The availability of models and lan-guages that abstract from architecturecomplexity has a significant impact onthe parallel software development pro-cess and therefore on the widespreaduse of parallel computing systems.

Thus we can hope that, within a fewyears, there will be models that are easyto program, providing at least moderateabstraction, that can be used with awide range of parallel computers, mak-ing portability a standard feature ofparallel programming, that are easy tounderstand, and that can be executedwith predictably good performance. Itwill take longer for software develop-ment methods to come into general use,but that should be no surprise becausewe are still struggling with softwaredevelopment for sequential program-ming. Computing costs for programs ispossible for any model with predictableperformance, but integrating such costsinto software development in a usefulway is much more difficult.

ACKNOWLEDGMENTS

We are grateful to Dave Dove and Luigi Palopolifor their comments on a draft of this article and tothe anonymous referees for their helpful com-ments.

REFERENCES

AGHA, G. 1986. Actors: A Model of ConcurrentComputation in Distributed Systems. MITPress, Cambridge, MA.

AGHA, G. AND CALLSEN, C. J. 1993. ActorSpace:An open distributed programming paradigm.In Proceedings of the Fourth ACM SIGPLANSymposium on Principles and Practice of Par-allel Programming (May), 23–32.

AHMED, S., CARRIERO, N., AND GELERNTER, D.



1994. A program building tool for parallelapplications. In DIMACS Workshop on Speci-fication of Parallel Algorithms (PrincetonUniversity, May), 161–178.

AHUJA, S., CARRIERO, N., GELERNTER, D., AND

KRISHNASWAMY, V. 1988. Matching lan-guages and hardware for parallel computationin the Linda machine. IEEE Trans. Comput.37, 8 (August), 921–929.

AMERICA, P. 1987. POOL-T: A parallel object-oriented language. In Object-Oriented Concur-rent Programming, A. Yonezawa et al., Eds.,MIT Press, Cambridge, MA, 199–220.

ANDRE, F. AND THOMAS, H. 1989. The PandoreSystem. In IFIP Working Conference on De-centralized Systems (Dec.) poster presenta-tion.

ANDRE, F., MAHEO, Y., LE FUR, M., AND PAZAT,J.-L. 1994. The Pandore compiler: Over-view and experimental results. Tech. Rep.PI-869, IRISA, October.

ANDREWS, G. R. AND OLSSON, R. A. 1993. TheSR Programming Language. Benjamin/Cum-mings, Redwood City, CA.

ANDREWS, G. R., OLSSON, R. A., COFFIN, M. A.,ELSHOFF, I., NILSEN, K., PURDIN, T., ANDTOWNSEND, G. 1988. An overview of the SRlanguage and implementation. ACM Trans.Program. Lang. Syst. 10, 1 (Jan.), 51–86.

ARBEIT, B., CAMPBELL, G., COPPINGER, R., PEIERLS,R., AND STAMPF, D. 1993. The ALMS sys-tem: Implementation and performance.Brookhaven National Laboratory, internal re-port.

BACK, R. J. R. 1989a. A method for refiningatomicity in parallel algorithms. In PARLE89Parallel Architectures and Languages Europe(June) LNCS 366, Springer-Verlag, 199–216.

BACK, R. J. R. 1989b. Refinement calculus partII: Parallel and reactive programs. Tech. Rep.93, Åbo Akademi, Departments of ComputerScience and Mathematics, SF-20500 Åbo, Fin-land.

BACK, R. J. R. AND SERE, K. 1989. Stepwise re-finement of action systems. In Mathematics ofProgram Construction, LNCS 375, June,Springer-Verlag, 115–138.

BACK, R. J. R. AND SERE, K. 1990. Deriving anOccam implementation of action systems.Tech. Rep. 99, Åbo Akademi, Departments ofComputer Science and Mathematics,SF-20500 Åbo, Finland.

BAIARDI, F., DANELUTTO, M., JAZAYERI, M., PEL-AGATTI, S., AND VANNESCHI, M. 1991. Archi-tectural models and design methodologies forgeneral-purpose highly-parallel computers. InIEEE CompEuro 91—Advanced ComputerTechnology, Reliable Systems and Applica-tions (May).

BAL, H. E., STEINER, J. G., AND TANENBAUM, A. S.1989. Programming languages for distrib-

uted computing systems. Comput. Surv. 21, 3(Sept.), 261–322.

BAL, H. E., TANENBAUM, A. S., AND KAASHOEK,M. F. 1990. Orca: A language for distrib-uted processing. ACM SIGPLAN Not. 25 5(May), 17–24.

BANATRE, J. P. AND LE METAYER, D. 1991. In-troduction to Gamma. In Research Directionsin High-Level Parallel Programming Lan-guages, J. P. Banatre and D. Le Metayer,Eds., LNCS 574, June, Springer-Verlag, 197–202.

BANGER, C. 1992. Arrays with categorical typeconstructors. In ATABLE’92 Proceedings of aWorkshop on Arrays (June), 105–121.

BANGER, C. R. 1995. Construction of multidi-mensional arrays as categorical data types.Ph.D. Thesis, Queen’s University, Kingston,Canada.

BAUDE, F. 1991. Utilisation du paradigme ac-teur pour le calcul parallele. Ph.D. thesis,Universite de Paris-Sud.

BAUDE, F. AND VIDAL-NAQUET, G. 1991. Actorsas a parallel programming model. In Proceed-ings of Eighth Symposium on Theoretical As-pects of Computer Science. LNCS 480, Spring-er-Verlag.

BEGUELIN, A., DONGARRA, J., GEIST, A., MANCHEK,R., MOORE, K., AND SUNDERAM, V. 1993.PVM and HeNCE: Tools for heterogeneousnetwork computing. In Software for ParallelComputation, NATO ASI Series F., Vol. 106J. S. Kowalik and L. Grandinetti, Eds.Springer-Verlag, 91–99.

BEGUELIN, A., DONGARRA, J., GEIST, A., MANCHEK,R., AND SUNDERAM, V. 1995. Recent en-hancements to PVM. Int. J. Supercomput.Appl. High Perform. Comput.

BEGUELIN, A., DONGARRA, J. J., GEIST, G. A.,MANCHEK, R., AND SUNDERAM, V. S. PVMsoftware system and documentation. Email [email protected].

BEGUELIN, A., DONGARRA, J., GEIST, A., MANCHEK,R., OTTO, S., AND WALPOLE, J. 1994. PVM:Experiences, current status and future direc-tion. Tech. Rep. CS/E 94-015, Oregon Gradu-ate Institute CS.

BERSHAD, B. N., LAZOWSKA, E., AND LEVY, H.1988. Presto: A system for object-orientedparallel programming. Softw. Pract. Exper.(August).

BICKFORD, M. 1994. Composable specificationsfor asynchronous systems using UNITY. InProceedings International Symposium on Ad-vanced Research in Asynchronous Circuitsand Systems (Nov.), 216–227.

BILARDI, G., HERLEY, K. T., PIETRACAPRINA, A.,PUCCI, G., AND SPIRAKIS, P. 1996. BSP vs.LogP. In Proceedings of the Eighth AnnualSymposium on Parallel Algorithms and Archi-tectures (June), 25–32.



BIRD, R. S. 1987. An introduction to the theoryof lists. In Logic of Programming and Calculiof Discrete Design, M. Broy, Ed., Springer-Verlag, 3–42.

BLACK, A., HUTCHINSON, N., JUL, E., LEVY, H., ANDCARTER, L. 1987. Distribution and abstracttypes in Emerald. IEEE Trans. Softw. Eng.13, 1 (Jan.), 65–76.

BLACK, A. ET AL. 1984. EPL programmers’guide. University of Washington, Seattle,June.

BLELLOCH, G. 1987. Scans as primitive paralleloperations. In Proceedings of the Interna-tional Conference on Parallel Processing (Au-gust), 355–362.

BLELLOCH, G. E. 1990. Vector Models for Data-Parallel Computing. MIT Press, Cambridge,MA.

BLELLOCH, G. E. 1993. NESL: A nested data-parallel language. Tech. Rep. CMU-CS-93-129, Carnegie Mellon University, April.

BLELLOCH, G. E. 1996. Programming parallelalgorithms. Commun. ACM 37, 3 (March).

BLELLOCH, G. AND GREINER, J. 1994. A parallelcomplexity model for functional languages.Tech. Rep. CMU-CS-94-196, School of Com-puter Science, Carnegie Mellon University,October.

BLELLOCH, G. E. AND SABOT, G. W. 1988. Com-piling collection-oriented languages onto mas-sively parallel computers. In Proceedings ofthe Second Symposium on the Frontiers ofMassively Parallel Computation, 575–585.

BLELLOCH, G. E. AND SABOT, G. W. 1990. Com-piling collection-oriented languages onto mas-sively parallel computers. J. Parallel Distrib.Comput., 119–134.

BRIAT, J., FAVRE, M., GEYER, C., AND CHASSIN DEKERGOMMEAUX, J. 1991. Scheduling of OR-parallel Prolog on a scalable reconfigurabledistributed memory multiprocessor. In Pro-ceedings of PARLE 91, Springer Lecture Notesin Computer Science 506, Springer-Verlag,385–402.

BUHR, P. A. AND STROOBOSSCHER, R. A. 1990.The mSystem: Providing light-weight concur-rency on shared-memory multiprocessor com-puters running UNIX. Softw. Pract. Exper. 20,9 (Sept.), 929–963.

BUHR, P. A., MACDONALD, H. I., AND STROOBOSS-CHER, R. A. 1991. mSystem reference man-ual, version 4.3.3. Tech. Rep., Department ofComputer Science, University of Waterloo,Waterloo, Ontario, Canada, N2L 3G1, March.

BURKHART, H., FALCO KORN, C., GUTZWILLER, S.,OHNACKER, P., AND WASER, S. 1993. BACS:Basel algorithm classification scheme. Tech.Rep. 93-3, Institut fur Informatik der Univer-sitat Basel, March.

BUTLER, R. AND LUSK, E. 1992. User’s guide tothe p4 programming system. Tech. Rep. ANL-

92/17, Argonne National Laboratory, Mathe-matics and Computer Science Division, Octo-ber.

CANNATARO, M., DI GREGORIO, S., RONGO, R., SPA-TARO, W., SPEZZANO, G., AND TALIA, D. 1995.A parallel cellular automata environment onmulticomputers for computational science.Parallel Comput. 21, 5, 803–824.

CANNATARO, M., SPEZZANO, G., AND TALIA, D.1991. A parallel logic system on a multicom-puter architecture. In Fut. Gen. Comput. Syst.6, 317–331.

CARRIERO, N. 1987. Implementation of tuplespace machines. Tech. Rep. YALEU/DCS/RR-567, Department of Computer Science, YaleUniversity, December.

CARRIERO, N. AND GELERNTER, D. 1988. Appli-cation experience with Linda. In ACM/SIG-PLAN Symposium on Parallel Programming(July), Vol. 23, 173–187.

CARRIERO, N. AND GELERNTER, D. 1993. Learn-ing from our success. In Software for ParallelComputation, NATO ASI Series F, Vol. 106J. S. Kowalik and L. Grandinetti, Eds.Springer-Verlag, 37–45.

CHANDY, K. M. AND MISRA, J. 1988. ParallelProgram Design: A Foundation. Addison-Wes-ley, Reading, MA.

CHASSIN DE KERGOMMEAUX, J. AND CODOGNET, P.1994. Parallel logic programming systems.ACM Comput. Surv. 26, 3, 295–336.

CHEN, M., CHOO, Y.-I., AND LI, J. 1991. Crystal:Theory and pragmatics of generating efficientparallel code. In Parallel Functional Lan-guages and Compilers, B. K. Szymanski, Ed.,ACM Press Frontier Series, New York, 255–308.

CHIEN, A. A. 1991. Concurrent aggregates: Us-ing multiple-access data abstractions to man-age complexity in concurrent programs.OOPS Messenger 2, 2 (April), 31–36.

CHIEN, A. A. AND DALLY, W. J. 1990. Con-current Aggregates. In Second SIGPLANSymposium on Principles and Practice of Par-allel Programming (Feb.), 187–196.

CHIN, A. AND MCCOLL, W. F. 1994. Virtualshared memory: Algorithms and complexity.Inf. Comput. 113, 2 (Sept.), 199–219.

COLE, M. 1989. Algorithmic Skeletons: Struc-tured Management of Parallel Computation.Research Monographs in Parallel and Distrib-uted Computing, Pitman, London.

COLE, M. 1990. Towards fully local multicom-puter implementations of functional pro-grams. Tech. Rep. CS90/R7, Department ofComputing Science, University of Glasgow,January.

COLE, M. 1992. Writing parallel programs innon-parallel languages. In Software for Paral-lel Computers: Exploiting Parallelism throughSoftware Environments, Tools, Algorithms



and Application Libraries, R. Perrot, Ed.,Chapman and Hall, London, 315–325.

COLE, M. 1994. Parallel programming, list ho-momorphisms and the maximum segmentsum problem. In Parallel Computing: Trendsand Applications, G. R. Joubert, Ed., North-Holland, 489–492.

CONERY, J. S. 1987. Parallel Execution of LogicPrograms. Kluwer Academic.

COOPER, R. AND HAMILTON, K. G. 1988. Pre-serving abstraction in concurrent program-ming. IEEE Trans. Softw. Eng. SE-14, 2,258–263.

CREVEUIL, C. 1991. Implementation of Gammaon the Connection Machine. In Research Di-rections in High-Level Parallel ProgrammingLanguages, J. P. Banatre and D. Le Metayer,Eds., LNCS 574, June, Springer-Verlag, 219–230.

CULLER, D., KARP, R., PATTERSON, D., SAHAY, A.,SCHAUSER, K. E., SANTOS, E., SUBRAMONIAN,R., AND VON EICKEN, T. 1993. LogP: To-ward a realistic model of parallel computa-tion. In ACM SIGPLAN Symposium on Prin-ciples and Practice of Parallel Programming(May).

DALLY, W. J. AND WILLS, D. S. 1989. Universalmechanisms for concurrency. In PARLE ’89,Parallel Architectures and Languages Europe(June), LNCS 365, Springer-Verlag, 19–33.

DALLY, W. J., FISKE, J. A. S., KEEN, J. S., LETHIN,R. A., NOAKES, M. D., NUTH, P. R., DAVISON,R. E., AND FYLER, G. A. 1992. The message-driven processor. IEEE Micro (April), 23–39.

DANELUTTO, M., DI MEGLIO, R., ORLANDO, S., PEL-AGATTI, S., AND VANNESCHI, M. 1991a. Amethodology for the development and the sup-port of massively parallel programs. Fut. Gen.Comput. Syst. Also appears as “The P3L lan-guage: An introduction,” Hewlett-PackardRep. HPL-PSC-91-29, December.

DANELUTTO, M., DI MEGLIO, R., PELAGATTI, S., AND

VANNESCHI, M. 1990. High level languageconstructs for massively parallel computing.Tech. Rep., Hewlett-Packard Pisa ScienceCenter, HPL-PSC-90-19.

DANELUTTO, M., PELAGATTI, S., AND VANNESCHI,M. 1991b. High level languages for easymassively parallel computing. Tech. Rep.,Hewlett-Packard Pisa Science Center, HPL-PSC-91-16.

DARLINGTON, J., CRIPPS, M., FIELD, T., HARRISON,P. G., AND REEVE, M. J. 1987. The designand implementation of ALICE: A parallelgraph reduction machine. In Selected Reprintson Dataflow and Reduction Architectures,S. S. Thakkar, Ed., IEEE Computer SocietyPress, Los Alamitos, CA.

DARLINGTON, J., FIELD, A. J., HARRISON, P. G.,KELLY, P. H. J., WU, Q., AND WHILE, R. L.1993. Parallel programming using skeleton

functions. In PARLE93, Parallel Architecturesand Languages Europe (June).

DARLINGTON, J., GUO, Y., AND YANG, J. 1995.Parallel Fortran family and a new perspec-tive. In Massively Parallel ProgrammingModels (Berlin, Oct.), IEEE Computer SocietyPress, Los Alamitos, CA.

DE DINECHIN, F., QUINTON, P., AND RISSET, T.1995. Structuration of the ALPHA language.In Massively Parallel Programming Models(Berlin, Oct.), IEEE Computer Society Press,Los Alamitos, CA, 18–24.

DONGARRA, J., OTTO, S. W., SNIR, M., AND WALKER,D. 1995. An introduction to the MPI stan-dard. Tech. Rep. CS-95-274, University ofTennessee. Available at http://www.netlib.org/tennessee/ut-cs-95-274.ps, January.

DONGARRA, J. J., OTTO, S. W., SNIR, M., AND

WALKER, D. 1996. A message passing stan-dard for MPP and workstations. Commun.ACM 39, 7, 84–90.

DOUGLAS, A., ROWSTRON, A., AND WOOD, A.1995. ISETL-Linda: Parallel programmingwith bags. Tech. Rep. YCS 257, Department ofComputer Science, University of York, UK,September.

ECKART, J. D. 1992. Cellang 2.0: Referencemanual. SIGPLAN Not. 27, 8, 107–112.

EISENBACH, S. AND PATTERSON, R. 1993. p-calcu-lus semantics for the concurrent configurationlanguage Darwin. In Proceedings of 26th An-nual Hawaii International Conference on Sys-tem Science (Jan.) Vol. II, IEEE ComputerSociety Press, Los Alamitos, CA.

EKANADHAM, K. 1991. A perspective on Id. InParallel Functional Languages and Compil-ers, B. K. Szymanski, Ed., ACM Press, NewYork, 197–254.

FAGIN, B. S. AND DESPAIN, A. M. 1990. The per-formance of parallel Prolog programs. IEEETrans. Comput. C-39, 12, 1434–1445.

FAIGLE, C., FURMANSKI, W., HAUPT, T., NIEMIC, J.,PODGORNY, M., AND SIMONI, D. 1993.MOVIE model for open systems based highperformance distributed computing. Concur-rency Pract. Exper. 5, 4 (June), 287–308.

FAUST, J. E. AND LEVY, H. M. 1990. The perfor-mance of an object-oriented threads package.In OOPSLA ECOOP ’90 Conference on Object-Oriented Programming: Systems, Languages,and Applications, (Ottawa, Canada, Oct.),278–288, Published in SIGPLAN Not. 25, 10(Oct.).

FELDCAMP, D. AND WAGNER, A. 1993. Parsec: Asoftware development environment for perfor-mance oriented parallel programming. InTransputer Research and Applications 6. S.Atkins and A. Wagner, Eds., May, IOS Press,Amsterdam, 247–262.



FLYNN-HUMMEL, S. AND KELLY, R. 1993. A ratio-nale for parallel programming with sets. J.Program. Lang. 1, 187–207.

FOSTER, I. AND TAYLOR, S. 1990. Strand: NewConcepts in Parallel Programming. Prentice-Hall, Englewood Cliffs, NJ.

FOSTER, I. AND TUECKE, S. 1991. Parallel pro-gramming with PCN. Tech. Rep. ANL-91/32,Rev. 1, Mathematics and Computer ScienceDivision, Argonne National Laboratory, De-cember.

FOSTER, I., OLSON, R., AND TUECKE, S. 1992.Productive parallel programming: The PCNapproach. Sci. Program. 1, 1, 51–66.

FOSTER, I., TUECKE, S., AND TAYLOR, S. 1991. Aportable run-time system for PCN. Tech. Rep.ALMS/MCS-TM-137. Argonne National Labo-ratory and CalTech, December. Available atftp://info.mcs.anl.gov/pub/tech reports/.

GEHANI, H. H. AND ROOME, W. D. 1986. Con-current C. Softw. Pract. Exper. 16, 9, 821–844.

GEIST, G. A. 1993. PVM3: Beyond network com-puting. In Parallel Computation, J. Volkert,Ed., LNCS 734, Springer-Verlag, 194–203.

GERTH, R. AND PNUELI, A. 1989. RootingUNITY. In Proceedings of the Fifth Interna-tional Workshop on Software Specificationand Design (Pittsburgh, PA, May).

GIBBONS, J. 1991. Algebras for tree algorithms.D. Phil. Thesis, Programming ResearchGroup, University of Oxford.

GOGUEN, J. AND WINKLER, T. 1988. IntroducingOBJ3. Tech. Rep. SRI-CSL-88-9, SRI Interna-tional, Menlo Park, CA, August.

GOGUEN, J., KIRCHNER, C., KIRCHNER, H., MEGRE-LIS, A., MESEGUER, J., AND WINKLER, T.1994. An introduction to OBJ3. In LectureNotes in Computer Science, Vol. 308, Spring-er-Verlag, 258–263.

GOGUEN, J., WINKLER, T., MESEGUER, J., FUTAT-SUGI, K., AND JOUANNAUD, J.-P. 1993. In-troducing OBJ. In Applications of AlgebraicSpecification using OBJ. J. Gouguen, Ed.,Cambridge.

GOLDMAN, K. J. 1989. Paralation views: Ab-stractions for efficient scientific computing onthe Connection Machine. Tech. Rep. MIT/LCS/TM398, MIT Laboratory for ComputerScience.

GOTTLIEB, A. ET AL. 1983a. The NYU Ultracom-puter—designing an MIMD shared memoryparallel computer. IEEE Trans. Comput. C-32(Feb.), 175–189.

GOTTLIEB, A., LUBACHEVSKY, B., AND RUDOLPH,L. 1983b. Basic techniques for the efficientcoordination of large numbers of cooperatingsequential processes. ACM Trans. Program.Lang. Syst. 5, 2 (April), 164–189.

GREGORY, S. 1987. Parallel Logic Programmingin PARLOG. Addison-Wesley, Reading, MA.

GRIMSHAW, A. S. 1991. An introduction to par-allel object-oriented programming with Men-tat. Tech. Rep. 91-07, Computer Science De-partment, University of Virginia, April.

GRIMSHAW, A. 1993a. Easy-to-use object-ori-ented parallel processing with Mentat. IEEEComput. 26, 5 (May), 39–51.

GRIMSHAW, A. S. 1993b. The Mentat computa-tion model: Data-driven support for object-oriented parallel processing. Tech. Rep. 93-30,Computer Science Department, University ofVirginia, May.

GRIMSHAW, A. S., LOYOT, E. C., AND WEISSMAN,J. B. 1991a. Mentat programming lan-guage. MPL Reference Manual 91-32, Univer-sity of Virginia, 3 November.

GRIMSHAW, A. S., LOYOT, E. C., SMOOTAND, S., AND

WEISSMAN, J. B. 1991b. Mentat user’s man-ual. Tech. Rep. 91-31, University of Virginia,3 November 1991.

HAINS, G. AND FOISY, C. 1993. The data-parallelcategorical abstract machine. In PARLE93,Parallel Architectures and Languages Europe(June), LNCS 694, Springer-Verlag.

HANSEN, P. B. 1978. Distributed processes: Aconcurrent programming concept. Commun.ACM 21, 11, 934–940.

HATCHER, P. J. AND QUINN, M. J. 1991. Data-Parallel Programming on MIMD Computers.MIT Press, Cambridge, MA.

HEINZ, E. A. 1993. Modula-3*: An efficientlycompilable extension of Modula-3 for problem-oriented explicitly parallel programming. InProceedings of the Joint Symposium on Paral-lel Processing 1993 (Waseda University, To-kyo, May), 269–276.

HEMPEL, R. 1991. The ANL/GMD macros (PAR-MACS) in FORTRAN for portable parallelprogramming using the message passing pro-gramming model—users’ guide and referencemanual. Tech. Rep., GMD, Postfach 1316,D-5205 Sankt Augustin 1, Germany, Novem-ber.

HEMPEL, R., HOPPE, H.-C., AND SUPALOV, A.1992. PARMACS-6.0 library interface speci-fication. Tech. Rep., GMD, Postfach 1316,D-5205 Sankt Augustin 1, Germany, Decem-ber.

HERATH, J., YUBA, T., AND SAITO, N. 1987.Dataflow computing. In Parallel Algorithmsand Architectures (May), LNCS 269, 25–36.

HIGH PERFORMANCE FORTRAN LANGUAGE

SPECIFICATION 1993. Available by ftp fromtitan.rice.cs.edu, January.

HOARE, C. A. R. 1985. Communicating Sequen-tial Processes. International Series in Com-puter Science, Prentice-Hall, EnglewoodCliffs, NJ.

HUDAK, P. 1986. Para-functional programming.IEEE Comput. 19, 8, 60–70.



HUDAK, P. AND FASEL, J. 1992. A gentle intro-duction to Haskell. SIGPLAN Not. 27, 5(May).

HUMMEL, R., KELLY, R., AND FLYNN-HUMMEL,S. 1991. A set-based language for prototyp-ing parallel algorithms. In Proceedings of theComputer Architecture for Machine Perception’91 Conference (Dec.).

JAFFAR, J. AND LASSEZ, J.-L. 1987. Constraintlogic programming. In ACM Symposium onPrinciples of Programming Languages, ACMPress, New York, 111–119.

JAY, C. B. 1995. A semantics for shape. Sci.Comput. Program. 25, (Dec.) 251–283.

JERID, L., ANDRE, F., CHERON, O., PAZAT, J. L., ANDERNST, T. 1994. HPF to C-PANDOREtranslator. Tech. Rep. 824, IRISA: Institut deRecherche en Informatique et Systemes Alea-toires, May.

JONES, G. AND GOLDSMITH, M. 1988. Pro-gramming in Occam2. Prentice-Hall, Engle-wood Cliffs, NJ.

HALSTEAD, R. H. JR. 1986. Parallel symboliccomputing. IEEE Comput. 19, 8 (Aug.).

KALE, L. V. 1987. The REDUCE-OR processmodel for parallel evaluation of logic pro-grams. In Proceedings of the Fourth Interna-tional Conference on Logic Programming(Melbourne, Australia), 616–632.

KARP, R. M. AND RAMACHANDRAN, V. 1990. Par-allel algorithms for shared-memory machines.In Handbook of Theoretical Computer Science,Vol. A, J. van Leeuwen, Ed., Elsevier SciencePublishers and MIT Press.

KELLY, P. 1989. Functional Programming forLoosely-Coupled Multiprocessors. Pitman,London.

KEßLER, C. W. AND SEIDL, H. 1995. Integratingsynchronous and asynchronous paradigms:The Fork95 parallel programming language.In Massively Parallel Programming Models(Berlin, Oct.), IEEE Computer Society Press,Los Alamitos, CA.

KILIAN, M. F. 1992a. Can O-O aid massivelyparallel programming? In Proceedings of theDartmouth Institute for Advanced GraduateStudy in Parallel Computation SymposiumD. B. Johnson, F. Makedon, and P. Metaxas,Eds. (June), 246–256.

KILIAN, M. F. 1992b. Parallel sets: An object-oriented methodology for massively parallelprogramming. Ph.D. Thesis, Harvard Univer-sity, 1992.

KIM, W.-Y. AND AGHA, G. 1995. Efficient sup-port of location transparency in concurrentobject-oriented programming languages. InSupercomputing ’95.

KNUTH, D. E. 1976. Big omicron and big omegaand big theta. SIGACT 8, 2, 18–23.

KUNG, H. T. 1982. Why systolic architectures?IEEE Comput. 15, 1, 37–46.

LARUS, J. R., RICHARDS, B., AND VISWANATHAN,G. 1992. C**: A large-grain, object-ori-ented, data-parallel programming language.Tech. Rep. TR1126, University of Wisconsin-Madison, November.

LEA, D. 1996. Concurrent Programming inJava: Design Principles and Patterns. Addi-son-Wesley, Reading, MA.

LECHNER, U., LENGAUER, C., AND WIRSING,M. 1995. An object-oriented airport: Speci-fication and Refinement in Maude. In RecentTrends in Data Type Specifications, E. Aste-siano, G. Reggio, and A. Tarlecki, Eds., LNCS906, Springer-Verlag, Berlin, 351–367.

LI, K. AND HUDAK, P. 1989. Memory coherencein shared virtual memory systems. ACMTrans. Comput. Syst. 7, 4 (Nov.), 321–359.

LINCOLN, P., MARTı-OILIET, N., AND MESEGUER,J. 1994. Specification, transformation, andprogramming of concurrent systems in rewrit-ing logic. Tech. Rep. SRI-CSL-94-11, SRI,May.

LISKOV, B. 1987. Implementation of Argus. InProceedings of the Eleventh Symposium onOperating Systems Principles, ACM Press,New York, 111–122.

LLOYD, J. W. 1984. Foundations of Logic Pro-gramming. Springer-Verlag.

LOBE, G., LU, P., MELAX, S., PARSONS, I., SCHAEF-FER, J., SMITH, C., AND SZAFRON, D. 1992.The Enterprise model for developing distrib-uted applications. Tech. Rep. 92-20, Depart-ment of Computing Science, University of Al-berta, November.

MALCOLM, G. 1990. Algebraic data types andprogram transformation. Ph.D. Thesis, Rijk-suniversiteit Groningen, September.

MCCOLL, W. F. 1993a. General purpose parallelcomputing. In Lectures on Parallel Computa-tion, A. M. Gibbons and P. Spirakis, Eds.,Cambridge International Series on ParallelComputation, Cambridge University Press,Cambridge, 337–391.

MCCOLL, W. F. 1993b. Special purpose parallelcomputing. In Lectures on Parallel Computa-tion, A. M. Gibbons and P. Spirakis, Eds.,Cambridge International Series on ParallelComputation, Cambridge University Press,Cambridge, 261–336.

MCCOLL, W. F. 1994a. Bulk synchronous paral-lel computing. In Second Workshop on Ab-stract Models for Parallel Computation, Ox-ford University Press, New York, 41–63.

MCCOLL, W. F. 1994b. An architecture inde-pendent programming model for scalable par-allel computing. In Portability and Perfor-mance for Parallel Processors, J. Ferrante andA. J. G. Hey, Eds., Wiley, New York.

MCGRAW, J., SKEDZIELEWSKI, S., ALLAN, S., OLDE-HOEFT, R., GLAUERT, J., KIRKHAM, C., NOYCE,B., AND THOMAS, R. 1985. Sisal: Streams



and iteration in a single assignment lan-guage: Reference manual 1.2. Tech. Rep.M-146, Rev. 1, Lawrence Livermore NationalLaboratory, March.

MCGRAW, J. R. 1993. Parallel functional pro-gramming in Sisal: Fictions, facts, and future.In Advanced Workshop, Programming Toolsfor Parallel Machines (June). Also availableas Laurence Livermore National LaboratoriesTech. Rep. UCRL-JC-114360.

MEHROTRA, P. AND HAINES, M. 1994. An over-view of the Opus language and runtime sys-tem. Tech. Rep. 94-39, NASA ICASE, May.

MENTAT 1994. Mentat tutorial. Available atftp://uvacs.cs.virginia.edu/pub/mentat/tutorial.ps.Z.

MESEGUER, J. 1992. A logical theory of concur-rent objects and its realization in the Maudelanguage. Tech. Rep. SRI-CSL-92-08, SRI In-ternational, July.

MESEGUER, J. AND WINKLER, T. 1991. Parallelprogramming in Maude. In Research Direc-tions in High-Level Parallel ProgrammingLanguages, J. P. Banatre and D. Le Metayer,Eds., LNCS 574, June, Springer-Verlag, 253–293.

MESSAGE PASSING INTERFACE FORUM 1993. MPI:A message passing interface. In Proceedingsof Supercomputing ’93, IEEE Computer Soci-ety, Washington, DC, 878–883.

MORE, T. 1986. On the development of arraytheory. Tech. Rep., IBM Cambridge ScientificCenter.

MUKHERJEE, B., EISENHAUER, G., AND GHOSH, K.1994. A machine independent interface forlightweight threads. ACM Oper. Syst. Rev. 28,1 (Jan.), 33–47.

MULLIN, L. M. R. 1988. A mathematics of ar-rays. Ph.D. Dissertation, Syracuse University,Syracuse, NY, December.

MUNDIE, D. A. AND FISHER, D. A. 1986. Parallelprocessing in Ada. IEEE Comput. C-19, 8,20–25.

MUSSAT, L. 1991. Parallel programming withbags. In Research Directions in High-LevelParallel Programming Languages, J. P. Ba-natre and D. Le Metayer, Eds., LNCS 574,June, Springer-Verlag, 203–218.

NEWTON, P. AND BROWNE, J. C. 1992. TheCODE2.0 graphical parallel programminglanguage. In Proceedings of the ACM Interna-tional Conference on Supercomputing (July).

NOAKES, M. O. AND DALLY, W. J. 1990. Systemdesign of the J-Machine. In Proceedings of theSixth MIT Conference on Advanced Researchin VLSI, MIT Press, Cambridge, MA, 179–194.

PEIERLS, R. AND CAMPBELL, G. 1991. ALMS—programming tools for coupling applicationcodes in a network environment. In Proceed-ings of the Heterogeneous Network-Based Con-

current Computing Workshop (Tallahassee,FL, Oct.). Supercomputing Computations Re-search Institute, Florida State University.Proceedings available via anonymous ftp fromftp.scri.fsu.edu in directory pub/parallel-work-shop.91.

PEREIRA, L. M. AND NASR, R. 1984. Delta-Pro-log: A distributed logic programming lan-guage. In Proceedings of the InternationalConference on Fifth Generation Computer Sys-tems (Tokyo, Nov.), 283–291.

PEYTON-JONES, S. L., CLACK, C., AND HARRIS,N. 1987. GRIP—a parallel graph reductionmachine. Tech. Rep., Department of Com-puter Science, University of London.

PEYTON-JONES, S. L. AND LESTER, D. 1992. Im-plementing Functional Programming Lan-guages. International Series in Computer Sci-ence, Prentice-Hall, Englewood Cliffs, NJ.

QUINN, M. J. AND HATCHER, P. J. 1990. Data-parallel programming in multicomputers.IEEE Softw. (Sept.), 69–76.

RABHI, F. A. AND MANSON, G. A. 1991. Experi-ments with a transputer-based parallel graphreduction machine. Concurrency Pract. Exper.3, 4 (August), 413–422.

RADESTOCK, M. AND EISENBACH, S. 1994. Whatdo you get from a p-calculus semantics? InPARLE’94 Parallel Architectures and Lan-guages Europe, C. Halatsis, D, Maritsas, G.Philokyprou, and S. Theodoridis, Eds., LNCS817, Springer-Verlag, 635–647.

RADHA, S. AND MUTHUKRISHNAN, C. 1992. A por-table implementation of unity on von Neu-mann machines. Comput. Lang. 18, 1, 17–30.

RAINA, S. 1990. Software controlled shared vir-tual memory management on a transputerbased multiprocessor. In Transputer Researchand Applications 4, D. L. Fielding, Ed., IOSPress, Amsterdam, 143–152.

RANADE, A. G. 1989. Fluent parallel computa-tion. Ph.D. thesis, Yale University.

SABOT, G. 1989. The Paralation Model: Archi-tecture-Independent Parallel Programming.MIT Press, Cambridge, MA.

SARASWAT, V. A., RINARD, M., PANANGADEN, P.1991. Semantic foundations of concurrentconstraint programming. In Proceedings of thePOPL ’91 Conference, ACM Press, New York,333–352.

SEUTTER, F. 1985. CEPROL, a cellular pro-gramming language. Parallel Comput. 2, 327–333.

SHAPIRO, E. 1986. Concurrent Prolog: Aprogress report. IEEE Comput. 19 (August),44–58.

SHAPIRO, E. 1989. The family of concurrentlogic programming languages. ACM Comput.Surv. 21, 3, 412–510.

SHEFFLER, T. J. 1992. Match and move, an ap-proach to data parallel computing. Ph.D. The-



sis, Carnegie-Mellon, October. Appears asRep. CMU-CS-92-203.

SINGH, P. 1993. Graphs as a categorical datatype. Master’s Thesis, Computing and Infor-mation Science, Queen’s University, King-ston, Canada.

SKEDZIELEWSKI, S. K. 1991. Sisal. In ParallelFunctional Languages and Compilers, B. K.Szymanski, Ed., ACM Press Frontier Series,New York, 105–158.

SKILLICORN, D. B. 1990. Architecture-indepen-dent parallel computation. IEEE Comput. 23,12 (Dec.), 38–51.

SKILLICORN, D. B. 1994a. Categorical datatypes. In Second Workshop on Abstract Mod-els for Parallel Computation, Oxford Univer-sity Press, New York, 155–168.

SKILLICORN, D. B. 1994b. Foundations of Paral-lel Programming. Number 6 in CambridgeSeries in Parallel Computation, CambridgeUniversity Press, New York.

SKILLICORN, D. B. 1996. Communication skele-tons. In Abstract Machine Models for Paralleland Distributed Computing, M. Kara, J. R.Davy, D. Goodeve, and J. Nash, Eds., (Leeds,April) IOS Press, The Netherlands, 163–178.

SKILLICORN, D. B. AND CAI, W. 1995. A cost cal-culus for parallel functional programming. J.Parallel Distrib. Comput. 28, 1 (July), 65–83.

SKILLICORN, D. B. AND TALIA, D. 1994. Pro-gramming Languages for Parallel Processing.IEEE Computer Society Press, Los Alamitos,CA.

SKILLICORN, D. B., HILL, J. M. D., AND MCCOLL,W. F. 1997. Questions and answers aboutBSP. Sci. Program. 6, 3, 249–274.

SPEZZANO, G. AND TALIA, D. 1997. A high-levelcellular programming model for massivelyparallel processing. In Proceedings of the Sec-ond International Workshop on High-LevelProgramming Models and Supportive Envi-ronments, HIPS’97, IEEE Computer SocietyPress, Los Alamitos, CA, 55–63.

SPIVEY, J. M. 1989. A categorical approach tothe theory of lists. In Mathematics of ProgramConstruction (June), LNCS 375. Springer-Ver-lag, 399–408.

STEELE, G. L., JR. 1992. High Performance For-tran: Status report. In Proceeding of a Work-shop on Languages, Compilers and Run-TimeEnvironments for Distributed Memory Multi-processors, (Sept.) appeared as SIGPLAN Not.28, 1 (Jan.), 1–4.

SUNDERAM, V. 1992. Concurrent computingwith PVM. In Proceedings of the Workshop onCluster Computing (Tallahassee, FL, Dec.).Supercomputing Computations Research In-stitute, Florida State University. Proceedingsavailable via anonymous ftp from ftp.scri.fsu.edu in directory pub/parallel-workshop.92.

SUNDERAM, V. S. 1990. PVM: A framework for

parallel distributed computing. Tech. Rep.ORNL/TM-11375, Department of Math andComputer Science, Emory University, OakRidge National Lab, February. Also Concur-rency Pract. Exper. 2 4 (Dec.), 315–349.

SWINEHART, D. ET AL. 1985. The structure ofCedar. SIGPLAN Not. 20, 7 (July), 230–244.

SZAFRON, D., SCHAEFFER, J., WONG, P. S., CHAN, E.,LU, P., AND SMITH, C. 1991. Enterprise: Aninteractive graphical programming environ-ment for distributed software. Available byftp from cs.ualberta.ca.

TALIA, D. 1993. A survey of PARLOG and Con-current Prolog: The integration of logic andparallelism. Comput. Lang. 18, 3, 185–196.

TALIA, D. 1994. Parallel logic programming sys-tems on multicomputers. J. Program. Lang. 2,1 (March), 77–87.

THOMPSON, S. J. 1992. Formulating Haskell.Tech. Rep. No. 29/92, Computing Laboratory,University of Kent, Canterbury, UK.

TSENG, C.-W. 1993. An optimizing Fortran Dcompiler for MIMD distributed-memory ma-chines. Ph.D. Thesis, Rice University, Janu-ary. Also Rice COMP TR-93-199.

UEDA, K. 1985. Guarded Horn clauses. Tech.Rep. TR-103, ICOT, Tokyo.

VALIANT, L. G. 1989. Bulk synchronous parallelcomputers. Tech. Rep. TR-08-89, ComputerScience, Harvard University.

VALIANT, L. G. 1990a. A bridging model for par-allel computation. Commun. ACM 33 8 (Au-gust), 103–111.

VALIANT, L. G. 1990b. General purpose parallelarchitectures. In Handbook of TheoreticalComputer Science, Vol. A, J. van Leeuwen,Ed., Elsevier Science Publishers and MITPress.

VAN HENTENRYCK, P. 1989. Parallel constraintsatisfaction in logic programming: Prelimi-nary results of Chip within PEPSys. In Pro-ceedings of the Sixth International Congresson Logic Programming, MIT, Cambridge, MA,165–180.

VIOLARD, E. 1994. A mathematical theory andits environment for parallel programming.Parallel Process. Lett. 4, 3, 313–328.

WATANABE, T. ET AL. 1988. Reflection in an ob-ject-oriented concurrent language. SIGPLANNot. 23 11 (Nov.), 306–315.

WILKINSON, T., STIEMERLING, T., OSMON, P., SAULS-BURY, A., AND KELLY, P. 1992. Angel: A pro-posed multiprocessor operating system ker-nel. In Proceedings of the EuropeanWorkshops on Parallel Computing (EWPC’92),(Barcelona, March 23–24) 316–319.

WILLS, D. S. 1990. Pi: A parallel architectureinterface for multi-model execution. Tech.Rep. AI-TR-1245, MIT Artificial IntelligenceLaboratory.



WINKLER, T. 1993. Programming in OBJ andMaude. In Functional Programming, Concur-rency, Simulation and Automated Reasoning,P. E. Lauer, Ed., LNCS 693, Springer-Verlag,Berlin, 229–277.

YANG, A. AND CHOO, Y. 1991. Parallel-programtransformation using a metalanguage. In Pro-ceedings of the Third ACM SIGPLAN Sympo-sium on Principles and Practice of ParallelProgramming.

YANG, A. AND CHOO, Y. 1992a. Formal deriva-tion of an efficient parallel Gauss-Seidelmethod on a mesh of processors. In Proceed-ings of the Sixth International Parallel Pro-cessing Symposium, (March), IEEE ComputerSociety Press, Los Alamitos, CA.

YANG, A. AND CHOO, Y. 1992b. Metalinguisticfeatures for formal parallel-program transfor-mation. In Proceedings of the Fourth IEEEInternational Conference on Computer Lan-guages (April), IEEE Computer Society Press,Los Alamitos, CA.

YOKOTE, Y. ET AL. 1987. Concurrent program-ming in Concurrent Smalltalk. In Object-Ori-ented Concurrent Programming, A. Yonezawa

et al., Ed., MIT Press, Cambridge, MA, 129–158.

YONEZAWA, A. ET AL. 1987. Object-Oriented Con-current Programming. MIT Press, Cambridge,MA.

ZENITH, S. E. 1991a. The axiomatic character-ization of Ease. In Linda-Like Systems andtheir Implementation, Edinburgh ParallelComputing Centre, TR91-13, 143–152.

ZENITH, S. E. 1991b. A rationale for program-ming with Ease. In Research Directions inHigh-Level Parallel Programming Languages,J. P. Banatre and D. Le Metayer, Eds., LNCS574, June, Springer-Verlag, 147–156.

ZENITH, S. E. 1991c. Programming with Ease.Centre de Recherche en Informatique, EcoleNationale Superieure des Mines de Paris,Sept. 20.

ZENITH, S. E. 1992. Ease: the model and its im-plementation. In Proceeding of a Workshop onLanguages, Compilers and Run-Time Envi-ronments for Distributed Memory Multiproces-sors, (Sept.) appeared as SIGPLAN Not. 28, 1(Jan.), 1993, 87.

Received April 1996; revised April 1997; accepted December 1997



models and languages for parallel computationrsim.cs.illinois.edu/arch/qual_papers/systems/4.pdf ·...

Documents