accurate performance prediction using visual prototypes

CONCURRENCY: PRACTICE AND EXPERIENCEConcurrency: Pract. Exper.,Vol. 11(11), 615–634 (1999)

Accurate performance prediction using visualprototypes

G. R. RIBEIRO JUSTO∗, T. DELAITRE, M. J. ZEMERLY AND S. C. WINTER

Centre for Parallel Computing, Cavendish School of Computer Science, University of Westminster,115 New Cavendish Street, London W1M 8JS, UK(e-mail: [email protected])

SUMMARYBehavioural and performance analysis is a fundamental problem in the development of parallel(and distributed) programs. To address this problem, models and supporting environments arerequired to enable designers to build and analyse their programs. The model we put forward inthis paper combines graphical and textual representations of the program structure and usesdiscrete-event simulation for performance and behaviour predictions. A graphical environmentsupports our model, providing, amongst other features, a graphical editor, a simulation engineand a performance and behaviour visualisation tool. A number of case studies using thisenvironment are also provided for illustration and validation of our model. Prediction errorsobserved in comparisons of real execution and simulation of case studies have accuracy towithin 10%. Copyright 1999 John Wiley & Sons, Ltd.

1. INTRODUCTION

The problem with parallel programs is that, being very complex, their behaviour is difficultto predict through intuition alone. Thus parallel programs have much in common with otherengineering artefacts: e.g. space vehicles, buildings and electronic circuits. It should not besurprising (although programmers may resist it) that an engineering approach to parallelprogram design is not a dispensable option where satisfactory performance is required.Engineers employ a variety of tools to the problem in hand: mathematics, modelling skills,simulation techniques, instrumentation, measurement and visualisation through pictures.When combined with skill and intuition, such tools enable an experienced engineer topostulate the design of an artefact, simulate its behaviour and measure its performance (thelatter being a behaviour metric). A tool-based engineering approach is needed in parallelprogramming.

Commonly, parallel program development methods start with parallelising and portinga sequential code onto the target machine, and then running it to measure andanalyse its performance. Reparallelisation is required when the achieved performance isunsatisfactory. This is a time-consuming process and usually entails tuning and debuggingbefore an acceptable performance from the parallel program is obtained. Rapid prototypingis a useful approach to the design of (high-performance) parallel software in that completealgorithms, outline designs or even rough schemes can be evaluated, using performance

∗Correspondence to: G. R. Ribeiro Justo, Centre for Parallel Computing, Cavendish School of Computer Science,University of Westminster, 115 New Cavendish Street, London W1M 8JS, UK.Contract/grant sponsor: EPSRC PSTPA, UK; Contract/grant number: GR/K40468Contract/grant sponsor: EC; Contract/grant number: CIPA-CT93-0251, CP-93-5383

CCC 1040–3108/99/110615–20$17.50 Received 24 March 1999Copyright 1999 John Wiley & Sons, Ltd.

616 G. R. RIBEIRO JUSTOET AL.

modelling, at a relatively early stage in the program development life-cycle, with respectto possible platform configurations and mapping strategies. Modifying the platformconfigurations and program-to-platform mappings permits the prototype design to berefined, and this process may continue in an evolutionary fashion before any actual parallelcoding.

Performance modelling allows performance estimates of parallel systems to be predictedwithout actually running the program on the target system itself. There are various reasonswhy it may be difficult to investigate the behaviour of a system simply by conductinglive experiments, including non-availability of the system (e.g. the system may not havebeen purchased, or may be running in a critical operational context), and experimentalimpracticalities (e.g. lack of suitable instrumentation). A model enables the evaluation ofalternatives in a predictive ‘what if’ fashion without the need to conduct live experiments.Modelling a system consists of abstracting its salient features of interest to the study.Abstraction is usually necessary in order to obtain a tractable model. However, abstractionsof parallel systems may be difficult to specify as they must also be sufficiently detailed tobe accurate. The solution of a parallel system model is the system’s (predicted) behaviour,and may be obtained by analysis or simulation. Analytical solutions are highly prized, sincethey are widely applicable, but they can generally only be obtained in the simplest cases.Solutions of any practical value must be obtained by simulation.

A range of performance statistics can be accumulated about the parallel applicationprogram, the operating system software, the hardware, and the interactions between them.Analysis of the raw behaviour and the derived statistics can help the designer to identify andremedy algorithmic and architectural bottlenecks which limit the capacity of the system.

The EDPEPPS (Environment for the Design and Performance Evaluation of PortableParallel Software)[1] environment described in this paper supports a rapid prototypingphilosophy, based on graphical design, simulation, and visualisation. EDPEPPS supportsthe development cycle ‘design–simulate–visualise’ within the same environment, as thetools are fully integrated. This allows information about the program behaviour generatedby the simulation and visualisation tools to be related to the design. The simulation modelarchitecture is modular and extensible to allow modifications and change of platforms anddesign as and when required. Also, the EDPEPPS environment allows generation of codefor both simulation and real execution to run on the target platform.

In this paper, we first describe a model for representing software architectures (programglobal structure) aimed at architectural analysis, especially performance analysis. The ideais to extend the usual view of architecture description with some abstract concepts ofbehaviour and time. These concepts are then interpreted in an execution model whichprovides behaviour and performance predictions of the software architecture.

The remainder of the paper is organised as follows. In Section2 we review currentparallel system performance environments. In Section3, we describe our softwarearchitecture model with the main features of the model illustrated with examples.In Section 4, we propose an execution model for performance analysis of softwarearchitectures based on discrete-event simulation. Section5 describes the main aspects ofthe EDPEPPS environment, and Section6 describes the experience and results of usingour environment with four case studies. Those case studies also illustrate the accuracy ofour simulation model in predicting performance. We finally set out our conclusions anddirections for future research in Section7.

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 615–634 (1999)

PERFORMANCE PREDICTION USING VISUAL PROTOTYPES 617

2. PARALLEL SYSTEM PERFORMANCE MODELLING ENVIRONMENTS

Several parallel software modelling environments supporting performance engineeringactivities have been developed recently[2] but few of them follow the prototyping approachof EDPEPPS. In this Section, we describe the most significant of these environments andcompare them with the EDPEPPS environment.

The GRADE (GRAphical Development Environment)[3,4], part of the SEPP (SoftwareEngineering for Parallel Processing) toolset[5] supports the main activities in the parallelsoftware development cycle in a similar way to EDPEPPS. The key difference to EDPEPPSis its graphical language, GRAPNEL[3,4], which is based on its own message-passinginterface implemented on top of PVM[6] and therefore does not model exactly thesemantics ofC/PVM (unlike EDPEPPS). In addition, GRAPNEL is a ‘pure’ graphicallanguage in the sense that even the sequential part of each process must be describedgraphically, a potentially time consuming process. In EDPEPPS, combined graphical andtextual representations are permitted.

The HAMLET environment[7] supports the development of real-time applications basedon two specific platform transputers and PowerPCs. EDPEPPS focuses on the developmentof portable parallel applications based on PVM and heterogeneous workstation clusters.HAMLET consists of a design entry system (DES), a specification simulator (HASTE), adebugger and monitor (INQUEST), and a trace analysis tool (TATOO). However, the toolsare not tightly integrated as in the case of EDPEPPS but are applied separately. Anotherlimitation of HAMLET, compared with EDPEPPS, is the lack of an animation tool, whichis important for behavioural analysis of parallel applications.

N-MAP[8] proposes a performance/behaviour methodology based on the simulationof abstract specification of program skeletons from which behaviour and performancepredictions are derived. A specification is based on three components: tasks, processesand packets. A task refers to a sequential program segment, and the behaviour (andrequirements) of a task is expressed in units of time. Tasks are then ordered in processeswhich correspond to ‘virtual processors’. The packets denote the data transferred amongstvirtual processors. The development process in N-MAP consists of providing a set ofspecifications for the components above, from which traces of the program simulationare generated. However, the specifications denote program structures which affect theperformance, and not necessarily important properties of the program design. Therefore,although N-MAP is claimed to integrate performance and software engineering activities,it is biased towards abstract performance engineering. This means that the prototypes aretoo abstract to enable the easy derivation of a design and implementation of an applicationas it is supported in EDPEPPS.

The PEPS project[9] aimed at investigating modelling, characterisation, and monitoringof PVM programs for Transputer-based, embedded and workstation clusters platforms.Performance modelling in PEPS focuses on the performance evaluation of computerarchitectures. PEPS uses the Simulog simulation toolset (MODLINE) which offers a rangeof software and hardware components. The software layer in their model is hardwareindependent and based on F77. This approach is similar to EDPEPPS but our languageis based on C and our model defines only 43 instructions compared to 170 in PEPS. Inaddition, the PEPS model does not take into account the cache effect whilst EDPEPPSmodels both data and instruction caches[10].

The ALPSTONE project[11] supports the development of performance-oriented



prototypes based on the BACS description (Basel Algorithm Classification Scheme)[12].This is in the form of a macroscopic abstraction of program properties, such as processtopology and execution structure, data partitioning and distribution descriptions, andinteraction specifications. From this description based on program skeletons, it is possibleto generate a time model of the algorithm which allows performance estimation andprediction of the algorithm run time on a particular system with different data and systemsizes. This project differs from EDPEPPS in three ways. Firstly, the language used todescribe the program skeletons in not graphical and too abstract, which means that thederivation of an implementation involves several refinement steps. Secondly, the approachto performance prediction is based on analytical modelling and not on discrete-eventsimulation models, as EDPEPPS. Finally, unlike EDPEPPS, the various tools of theenvironment are not fully integrated.

3. A PERFORMANCE-ORIENTED PARALLEL SOFTWARE ARCHITECTUREMODEL

In spite of the increasing number of Architecture Description Languages (ADLs) currentlyavailable, there is little consensus in the research community on what an ADL is andwhat aspects of an architecture should be modelled by an ADL[13]. A framework forclassifying ADLs has been proposed by Medvidovic and Taylor[13]. In their framework,the authors describe the components, interface, connectors and architectural configurationsas the main modelling features for an ADL. We present below our model and how it relatesto Medvidovic and Taylor’s framework.

3.1. Components

A component usually denotes a unit of computation and is, therefore, the locus ofcomputation and state[14]. In our model, a component roughly corresponds to a unitof computation or a process (as we intend to model parallel and distributed systems).In general, ADLs do not model component semantics beyond their interface (whichcorresponds to the interaction points between the component and its environment asdescribed in the next Section). In [15], for example, the specification of the componentcan be optionally defined as a composition of the CSP[16] specifications corresponding tothe protocol of each port.

The restricted semantic models of a component provided by many ADLs usually reducethe type of analysis that can be carried out at the architectural level. In [17], we have shownhow an extended graphical description can enable us to apply deadlock-free compositionand refinement rules. In [18], the author discusses the importance of causal, timing andeven probabilistic relationships between actions to enable the specification of behaviour ofparallel and distributed systems.

The description of components in our model is divided into two main parts: the designgraphical representation (external view) and the textual representation (internal view). Theexternal view consists of an abstract view of components of a parallel and distributedapplication design, that is the processes, their interfaces (behavioural description) and theirinteractions which are represented graphically. The internal view consists of the detailedbehaviour of each process and is represented textually.



Composite Component

Figure 1. Graphical and textual partial description of a component

In our model, the behaviour of a component is not formally specified, as we are moreconcerned with the derivation of an implementation from the architecture, but timingand stochastic expressions can be included in the description of a component in a C-like fashion[1]. Every graphical object corresponds to some textual representation (wecurrently use C/PVM segments of code[6]). An example is shown in Figure1, wherepart of the code of componentslave2is illustrated. Note that the lines of code associatedwith graphical symbols are underlined and numbered with the same number shown in thegraphical representation. The figure exemplifies the send (action number 4) and receive(number 5) actions. The interface actions are currently described in C/PVM[6] and areexplained further in the next Section.

3.2. Interface

A component’s interface specifies the protocols a component uses to interact with itsenvironment. They usually denote ports through which the component can perform severalactionsbut little information is provided about the causal or timing constraints between theinteraction. For example, if we assume a simple architecture described in [19], illustrated



1

1

1

1

1

1

A B

C

a) b)

c) d)

p1A p1

p1 C

B

A B

C

A B

C

1

1

2

1

Some Action Symbols

send

receive

signalsend

sendoptimised

mcast

Figure 2. Representing component interfaces

in Figure2(a), where componentA uses a port 1 to interact with both componentsB andC via their ports 1, several possible behaviours could be described. ComponentA couldselect with which component it interacts, defined by theor relation in Figure2(b). Theinteraction between componentsA andB could enable the action between componentsA

andC, as shown in Figure2(c). Finally, the interactions betweenA and componentsB andC could happen independently as illustrated in Figure2(d). So, in our model, we could usethe descriptions shown in Figure2(b), (c) and (d) to more specifically model the interfaceof componentA. A partial order of the actions is defined by the numbers presented withineach action graphical notation.

Actions are classified as eitherinput or output to denote that they require or providea service from/to other components, respectively. Graphical inputs are represented bypointing in triangle-like symbols, and outputs are represented by pointing out triangle-like symbols. Unlike actions in most ADLs, however, each action is uniquely represented,as illustrated in Figure2.

3.3. Composite component

Composite components or configurations are the graphs of connected components thatdescribe the architectural structure. A composite component contains instantiations ofcomponents and connectors (or connections) linking components’ ports. This informationis used to carry out consistency checks (such as valid connections) and to assessarchitectural (global) properties[20], such as deadlock and performance. The types ofassessment will certainly depend upon the information provided and, as previouslydiscussed, many ADLs provide little information to allow architectural analysis.

A composite component is represented in a similar manner to an ordinary component,as a double box. But a composite component can create instances of components. This isdone by the operationspawn. As we are dealing with parallel and distributed systems, thelocation where the instance is to be created can also be represented. Instance creation isviewed as a special action and, as it denotes control and affects behaviour, it is representedin the interface of the composite component. In addition, as composite components are thelevel at which configurations of interconnected instances are defined, for consistency they



Application Layer

Message Passing Layer

Operating System Layer

Hardware Layer

API

PVM, MPI

Sockets,Scheduler

CPU,Ethernet

Figure 3. The simulation model architecture

are usually closely associated with the configuration they create; this means compositecomponents cannot usually be represented in isolation. An example of a compositecomponent,Master2, and its configuration is presented in Figure1. A family of instancesof a componentSlave2is created.

4. EXECUTION MODEL

To analyse performance we use symbolic execution, more specifically discrete-eventsimulation, to run the representation of an architecture and produce predictions about itsperformance.

The simulation model consists of layers, as illustrated in Figure3, representing specificaspects of the system such as message passing (currentlyPVM), operating system (whichmodels process scheduling and the transport communication layer, either TCP/IP or UDP)and hardware (which models systems’ resources, such as hosts, CPU and the Ethernet)[21].Modularity and extensibility are key properties of the model as each layer is furtherdecomposed into modules, which can be easily reconfigured.

4.1. Time requirements

In many instances of design, timing requirements are critical and must be explicitlyrepresented, for example, to indicate intervals of time between two actions. In this case,time intervals are represented in absolute terms. When viewing components at architecturallevel, details of their computation are usually disregarded. These details, however, areimportant for performance analysis and should be represented in some way[22]. A solutionis to represent a computation by the (absolute) time it takes to be executed (e.g. 0.005 s).A more accurate representation is in terms oftiming equations, which specify the timeas a function of the possible number of operations involved in the computation. The kindof operation depends on the model used for performance analysis. In our model, timingexpressions can be specified using acputimefunction which models the time taken by asequence (block) of basic instructions (integer and float arithmetic operations, load andstore operations, function call, etc.) including possible cache accesses (cache hit and missoperations)[10]. The time taken by the computation also depends on the parameters set for



the machine model. For example, the call

cputime(2,1,3,1,0,2,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0);

is translated into two integer-store operations with cache hit, one integer-store operationwith cache miss, three float-store operations with cache hit, one float-store operation withcache miss, two accesses to an array, one integer addition, one float multiplication, oneexponential operation and one absolute value operation. The different kinds of operationsare ordered by position in calls to the functioncputime.

The simulation model contains an execution predictor component (more details inSection5), which checks at simulation time, depending on the type of machine the programis running, whether the cache is overflown or not, then it chooses the set of instructions withcache hit or miss.

In case of prototyping, the computation can be denoted by thecputimecalls at the endof each block but conditional branches and loops should still be explicitly defined.

4.2. Probability requirements

Another important aspect of the definition of an action is whether itmay or musthappen[18]. Most description languages only model themust case. However, in morerealistic models the distinction between the two relations should exist and, as suggestedby [18], one could even assign probabilities to the occurrences of actions. In our model,we provide the functionprob together with an ‘if’ to enable an action depending on theprobability of its occurrence. Other continuous and discrete distributions can be selected –for example, uniform, exponential and Poisson[21].

4.3. Portability

To provide portability of the architectural description, the simulation model is modularand parametrised, enabling us to use the same description to analyse performance of thesoftware architecture assuming different machines and operating systems by only settingnew parameters. In the current version, the environment can handle machines such asUltraSparc 10, SUN4s, SuperSparcs and Pentium I and II[10,21].

5. THE GRAPHICAL ENVIRONMENT

The main tools of the EDPEPPS environment are the graphical design tool (where thegraphical and textual representation of the architecture is created), the simulator (whichaccepts a description of the program and the architecture and generates trace and statisticsfiles) and the visualisation tool (which accepts the trace and statistics files together withthe graphical representation to provide statistical graphs and animation of the softwarearchitecture). Figure1 illustrates the EDPEPPS graphical editor and Figure4 presents asnapshot of the EDPEPPS visualisation tool.

The graphical design tool combines graphical and textual representation but thedesigners usually use the graphical representation, as the text associated with the graphicalelements is automatically created. The designer may, however, enter text (C/PVM code)directly using the text editor. To maintain consistency between the graphical and textual



Figure 4. The EDPEPPS visualisation tool

representations, the text associated with a graphical symbol is protected and can onlybe modified using the window related with that symbol. This enables the graphicaltool to automatically detect errors. To improve reusability the graphical and textualrepresentations are stored in separate files.

The visualisation tool provides two main types of support – a step-by-step animationof the design and statistical information about the various layers of the systems –from the application level (components and communications) to the operating system(scheduling and communication protocols) and hardware (processor and network) level.The important aspect is that at any time the designer can have a snapshot of the wholesystem performance. Figure4 illustrates visualisation within EDPEPPS. The main windowof the visualisation tool shows the animation of the graphical representation where on theright side the various statistical graphics are presented. Note that during the animationthe components and their interconnections are dynamically shown, as the executionprogresses.

The design and the visualisation tools are incorporated within the same GUI where thedesigner can switch between these two possible modes.

Another useful component of our environment is a Reverse Engineering Tool (RET)which allows the translation of existing parallel programss (currently C/PVM areimplemented) into graphical design representation. This is provided in order to allowlegacy parallel programs to be analysed with the toolset.

CPU characterisation tools are also an important part of the EDPEPPS environment.These tools allow the simulator to predict the execution time of each block of sequentialstatements of a component, depending on which machine the code is executed. The CPUtime characteriser consists of three tools.



Reverse

Tool

GraphicalDesign ToolEngineering

Run-timeEnvironment

MonitoringTool

SimulationTool

VisualisationTool

DebuggingTool

Figure 5. The EDPEPPS environment architecture

The Machine Characteriser (MC)characterises the performance of a particular machineby benchmarking the set of instructions. Currently we selected 43 instructions based onthe High Level Language Characterisation work of Savaadra[23] but extended to includeinstruction and data cache models[10].

The Program Instruction Characteriser (PIC)performs a static analysis by parsing thesource code of the program. The analysis involves counting the number of instructions ineach block and providing some parameters needed by the simulator for the cache models,which are evaluated at run time as the parser does not know onto which machine the processis mapped.

The Execution Predictor (EP)is responsible for the decision of whether the instructionand data caches were used in the execution of each block depending on the size of theinstruction and the data caches of the machine in question.

Finally, a distributed debugging facility, DDBG[24], has been integrated within theenvironment. Using DDBG the user is able to set break points at the graphical and textuallevels for debugging purposes. Currently, DDBG supports C/PVM.

5.1. The development process

The process starts with the graphical design tool (GDT) by building a graph representinga parallel program design. This graph could be obtained from the reverse engineering tool(RET) if the parallel application already exists. The graph is composed of computationaltasks and communications. The tool provides graphical representation for communicationcalls, such as send and receive, which the user can select to build the required design.The software designer can then generate (by the click of a button) real code (the currentimplementation is based on C andPVM) for both simulation and real execution. Figure5illustrates the flow of actions when using the environment.

In the simulation path the source code obtained from the graphical design tool isanalysed by PIC to characterise the code and insertcputimecalls at the end of each



computational block. The instrumented source files are translated, using a tool based on theSage++ toolkit[25], into a queueing network representation in SES/Workbench format. TheSES/Workbench translates the graph file into the Workbench object-oriented simulationlanguage called SES/sim[26] using an SES utility (sestran). Thesim file is then used togenerate an executable model using some SES/Workbench utilities, libraries, declarationsand the communication platform model (currently,PVM is implemented). The simulationis based on discrete-event modelling. SES/Workbench has been used both to developand simulate platform models. Thus the SES/Workbench simulation engine is an intrinsicpart of the toolset. All these actions are hidden from the user and are executed from thegraphical design window by a click on the simulation button. The simulation executable isrun using three input files containing parameters concerning the target virtual environment(e.g. number of hosts, host names, architecture, the UDP communication characteristicsand the timing costs for the set of instructions used by the EP tool[21]). The UDP modeland the instruction costs are obtained by benchmarking (benchmarks are provided off-line)the host machines in the network.

The simulation outputs are the execution time, a trace file and a statistics file. These filesare then used by the visualisation tool in conjunction with the current loaded applicationto animate the design and visualise the performance of the system. The design can bemodified and the same cycle is repeated until a satisfactory performance is achieved.

In the real execution path the C source files are generated automatically and thencompiled and executed to produce the trace file required for the visualisation/animationprocess. This step can be used for validation of simulation results but only when thetarget machine is accessible. The visualisation tool offers the designer graphical views(animation) representing the execution of the designed parallel application as well as thevisualisation of its performance.

6. CASE STUDIES

Four case studies are presented here to show various aspects of using the EDPEPPSenvironment. The first, the COMMS1 benchmark, is used to test the accuracy of theEDPEPPS communication model on its own. The second case study is the BesselEquation, which is a purely sequential C code to test the accuracy of the EDPEPPS CPUcharacterisation toolset (Chronos). The third case study uses a prototype of the PipelineProcessor Farm (PPF) model[27] using an image decoding application where the timedelays for the computation parts were extracted from[27]. This case study shows theusefulness of EDPEPPS for sizing of an application at an early level of development. Thefourth case study selected is a communication intensive application based on two parallelGivens Linear Solver methods developed in [28]. The EDPEPPS environment here shouldhelp choose the best algorithm and predict the results of a larger network.

6.1. COMMS1 benchmark

The COMMS1 benchmark is taken from the Parkbench[29] suite (version 3.0). COMMS1is designed to measure the communication performance of a parallel system by exchangingof messages of various sizes between two processors. COMMS1 is selected here tohighlight the accuracy of the communication model used.



0

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

0.01

0 500 1000 1500 2000 2500 3000 3500 4000

com

mun

icat

ion

time

(sec

)

PVM message size

Real executionSimulation

Figure 6. Comparison between predictions and measurements for COMMS1

Figure6 shows the execution time results for the COMMS1 benchmark between thepredictions and the measurements (averages of 1000 iterations) for two of our machines(a Sparc20 and a Pentium). The Figure shows a good match between the two curves. Thestep-like features are caused by fragmentation of the message into 1500-byte segments atthe IP level. Note that only message sizes of up to 4 Kbytes are used as PVM fragmentsmessages of a larger size.

6.2. Bessel equation

The computationally intensive application chosen here to validate the CPU model is theBessel equation, which is a differential equation defined by

x2d2y

dx+ x dy

dx+ (x2−m2)y = 0

Whenm is an integer, solutions to the Bessel differential equation are given by the Besselfunction of the second kind[10], also called the Neumann function or Weber function.

The Bessel function of the second kind is translated in C and executed 100,000 times.The results of the simulation and real execution obtained for four of our machines (ASparc20, a Sparc5, a Pentium and a 486) are given in Figure7.

The average error obtained for all four machines was about 6.13%. The errors for the486 machine are larger than those of other machines and this is probably due to the fact thatthere is only one cache used for both data and instructions in the 486 machine. However,for the superscalar machines, Pentium and SuperSparc, the results are encouraging, witherrors of only 1.7% and 3.4%, respectively.

6.3. A case study: PPF with CCITT H.261 decoder

The application chosen here is the pipeline processor farm (PPF) model of a standard imageprocessing algorithm, the H.261 decoder[30], proposed by Downtonet al.[27]. The H.261



Resolution of the Bessel equation of the second kind

18%

5%

1.70%3.40%

0

10

20

30

40

50

60

70

80

Pentium150MhzFreeBSD 2.1

SparcStation20SuperSparc60Mhz

Solaris2.4

SparcStation5MicroSparcII70Mhz

Solaris2.3

486DX4/100MhzFreeBSD 2.0.5

Machines

Tim

e (s

ec)

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

40.00%

45.00%

50.00%

Real Execution (sec)

Simulation (sec)

Difference in absolute value (%)

Figure 7. Comparison between real execution and simulation for the Bessel equation

algorithm decomposes into a three-stage parallel pipeline: frame initialisation (T1); framedecoder loop (T2) and frame output (T3). The first and last stages are inherently sequential,whereas the middle stage contains considerable data parallelism. Thus, for example, thePPF topology for a middle stage farm of five tasks is shown in Figure8(a). The numberof possible topologies which solve a given problem are clearly very large, even for theH.261 algorithm. The PPF model thus implicitly provides a rich set of experiments forvalidation of the simulator. The same topological variation in the PPF model leads directlyto performance variation in the algorithm, which, typically, is only poorly understood at theoutset of design. One of the main purposes of the simulation tool in this case is to enable adesigner to identify the optimal topology, quickly and easily, without resorting to run-timeexperimentation.

Two experiments for one and five images (frames) were carried out. The number ofprocessors in Stage T2 is varied from 1 to 5 (T1 and T3 were mapped on the sameprocessor). In every case, the load is evenly balanced between processors.

The target platform for this case study is a heterogeneous network of up to sixworkstations (SUN4s. SuperSparcs and PCs). Timings for the three algorithm stages wereextracted from [27] and inserted as time delays. Figure8(b) shows the simulated and realexperimental results for speed-up. Observe that the sequential time used to calculate thespeed-up was taken as that of the fastest machine.

As expected, the Figure shows that the 5-frame scenario performs better than the 1-framescenario, since the pipeline is fuller in the former case. The difference between simulatedand real speed-ups is below 10% even though the PPF simulation results do not includepacking costs.



(a)

5 T232 41

0 T1

T36

(b)

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 1 2 3 4 5 6

Spe

edU

p

Number of Processors in T2

PPF SpeedUp for 5 & 1 Frames

EDPEPPS Simulator 5 FramesReal Experiments 5 Frames

EDPEPPS Simulator 1 FrameReal Experiments 1 Frame

Figure 8. (a) PPF topology for a three-stage pipeline; (b) comparison between predictions and realexperiments for PPF

6.4. The Givens linear solver

The case study selected here falls into the communication and computation intensive typesof algorithms and can be represented in the following form:

Ax = b (1)

whereA is a non-singular square matrix,b is the right-hand-side vector andx is a vectorof unknowns. The Givens rotation method selected here is particularly interesting since itdoes not requirepivoting– a difficult problem to parallelise – is numerically stable and isinherently more accurate than other methods[28]. The Givens transformation is defined by



a 2× 2 rotation matrix

G =(c s

−s c

)(2)

wherec2+ s2 = 1.A Givens rotation is used to eliminate the elements of a vector or a matrix as follows:(

c s

−s c

)×(a

b

)=(r

0

)(3)

wherec = a/(√a2+ b2) ands = b/(√a2+ b2).The Givens algorithm for solving a linear system withN equations can be decomposed

into two computational stages. The first is the triangulation of the initial matrix; this stageis represented by the execution of the elimination blockN times. The second stage is thesubstitution block which solves the triangular matrix. The triangulation stage, which is themost time consuming part (with complexity of O(N3) as opposed to O(N2) for the back-substitution stage) is parallelised with two different techniques: collective and pipeline.The back-substitution stage is also programmed differently in the two algorithms, as willbe discussed later.

Initially, block-row data decomposition, which divides the matrix horizontally andassigns adjacent blocks of rows to neighbour processors, is used in both methods. The firststep (A) in the triangulation stage is the same for both methods. All processors eliminatethe required columns for their rows (except the first row for each processor, which willbe eliminated in the next step). Then, in the collective method, the columns of the firstrows (step B) are eliminated using collective communication (rows are collected by theprocessor, which is holding the corresponding row of the current eliminated column, saysender). The back-substitution stage is started by the last processor and its results arebroadcast to other processors, and then in a similar way all the processors solve theirtriangulated matrices in a back-pipeline way.

In the pipeline method instead of using collective communications to eliminate thecolumns of the first rows (step B in method 1), the row in question is passed fromsenderto its neighbour to eliminate its column and then passed in a pipeline fashion through otherneighbours until all the columns are eliminated. The last processor keeps the lines in a fullmatrix to be used in the back-substitution stage later on its own.

The two methods were designed using the EDPEPPS environment, and results for bothsimulation and real execution (on a real network) were obtained. The network used for thisalgorithm consists of three Ultra-Sparc 10 machines with 300 MHz, one Pentium II with233 MHz, one Pentium 150 MHz, one Super-Sparc 20 with 75 MHz and one Super-Sparc5 with 60 MHz. The machines were mapped differently in the two algorithms based on afew simulation tests and the final optimal mapping only is used for the results shown here.The mapping for the machines was done in increasing power order (but giving priorityto use the fastest machines if the number of processors is less than the total, seven) forthe pipeline method and in decreasing power order for the collective method. The testswere done for problem sizes of 256 and 512 equations, but for the 256 size we did not getany significant speedup relative to the fastest machine in the network as the algorithm iscommunication intensive and this will affect the performance of small problem sizes.

Figure9 shows the results for the simulation and real execution measurements (averagesof 10 times taken at night) for the sequential Givens on the various machines in the network



0

10

20

30

40

50

60

70

80

90

100

UltraSparc 10300 Mhz Solaris 2.6

Pentium II233Mhz FreeBSD

2.2.5

Pentium I150Mhz FreeBSD

2.2.6

SparcStation20SuperSparc60Mhz

Solaris2.5

SparcStation5MicroSparcII70Mhz

Solaris2.5

Tim

e (s

)

% d

iffer

ence

Real Execution (sec)

Simulation (sec)

Difference in absolute value (%)

Figure 9. Comparison between predictions and measurements for sequential Givens for matrix size512× 512

for problem sizes of 512× 512 (similar results were obtained but not shown here for256× 256). Note that the operating systems here are different from those of the Besselequation case study presented above and hence the machines needed to be benchmarkedagain to account for that.

The results show that the average error is well below 10% for all the machines. TheFigure also shows that the maximum error was about 12% for the Pentium I, which hereruns a different version of operating system than that presented for the Bessel equation.Note that the errors for the UltraSparc (1.5%) and Pentium II (5%) are much better thanthe other slower machines. In normal cases more testing and benchmarking needed to bedone on the machines with larger errors to make sure that the parameters obtained for theCPU instructions do not have unexpected high load which could cause some fluctuationsto the results obtained. However, for this case study we are satisfied with the current errorrate and will use the obtained parameters for the parallel algorithms.

Figure 10 shows the results for the simulation and real execution measurements(averages of 10 times taken at night) for both the collective and the pipeline methods.

Figure10 shows clearly that the measurements and predictions for both methods are ingood harmony with maximum error well below 10% (except for one case for the collectivealgorithm with seven processors, 13%). These errors contain factors for loading on themachines which were detected from larger times for some of the runs than others and,although averaged out with the other measurements, still had some effects. However, allthe measurements including the benchmarks have been performed at night at low load, andthe standard deviation for all the measurements did not exceed 5% on any of the runs. Asexpected, the Figure also shows the superiority of the pipeline algorithm over the collectivealgorithm even for small numbers of processors. Note that the machines are heterogeneousand adding more processors sometimes results in increasing the execution time ratherthan improving it. For the 512 problem size we obtained speedups of 2 and 2.25 for



0

5

10

15

20

25

30

1 2 3 4 5 6 7

Exe

cutio

n tim

e (s

ec)

No. of Processors

real_512_collsim_512_coll

real_512_pipesim_512_pipereal_256_collsim_256_coll

real_256_pipesim_256_pipe

Figure 10. Comparison between predictions and measurements for Parallel Givens: collective andpipeline (note that the results displayed for one processor are for the UltraSparc 10 as it is one of

the fastest machine)

four and five processors, respectively, for the pipeline algorithm. Increasing the number ofprocessors above five (four for the collective) in both cases increases the execution time asthe other processors (other than the UltraSparc and Pentium II – 233 MHz) are considerablyslower than the fastest four. However, in heterogeneous networks of workstations the CPUutilisation is another important factor which should be considered when analysing thefigures for the speedup or the execution time. This is because a network is a multi-taskingenvironment and reducing the load on one fast machine, even at the expense of largerexecution time, will give a chance for other tasks queueing on that machine to be executedfaster. Knowing that the pipeline algorithm is superior to the collective one, the user canthen experiment with adding more powerful processors than available to see if betterspeedups can be obtained (for example by substituting the Pentium I and less powerfulmachines by UltraSparcs or Pentium IIs). However this experiment could not be validatedand was not performed here.

As for the usability of the environment as a whole, all the above mentioned case studieswere performed in tutorials by students at the MSc (Parallel and Distributed Computing)level at the University of Westminster with little or no knowledge of PVM and noprevious experience of EDPEPPS but with some guidance from the tutor. The simulator istransparent to the user with the only interaction through PVMGraph (the graphical designtool) and PVMVis (the visualisation tool). The environment currently supportsC andPVMand the user needs to be familiar with the syntax of these languages. A demo of how tobuild the PPF case study (shown before) in PVMGraph can be seen online in [31].



7. CONCLUSIONS AND FURTHER RESEARCH

The paper addresses the fundamental limitations of current visual notations for paralleland distributed software architectures, which do not provide enough information to enablebehaviour and performance analysis. A new model based on an extended graphical andtextual description is put forward, the main features of which are the descriptions ofactions and temporal and stochastic relations between them, instead of the usual simpleinterfaces denoting only inputs and outputs. We also claim that the idea of representingeach action with a unique and numbered symbol helps designers to have much insightabout the behaviour of the components at the architectural level without having to look atthe internal details of the component.

Our current environment provided by EDPEPPS is based on a performance-orientedparallel program design method. The environment supports graphical design, performanceprediction through modelling and simulation, and visualisation of predicted programbehaviour. The designer is not required to leave the graphical design environment to viewthe program’s behaviour, since the visualisation is an animation of the graphical programdescription, and the transition between design and visualisation viewpoints is virtuallyseamless. It is intended that this environment will encourage a philosophy of programdesign, based on a rapid synthesis-evaluation design cycle, in the emerging breed of parallelprogrammers.

Success of the environment depends critically on the accuracy of the underlyingsimulation system. Preliminary validation experiments for the PVM-based platform modelhave been very encouraging, demonstrating errors between the simulation and the realexecution of around 10% for the examples presented in this paper.

Regarding future research, we are now looking at the problem of how to representpatterns and high-level components. At the moment, components and configurations can beeasily reused by referring to their names – that is, when a component (simple or composite)is created the tool checks if a component with that name already exists and asks if thedesigner wishes to reuse it. There is not, however, a notion of constraints and templateswhich are necessary for instantiating a style or pattern. We are particularly interested indescribing styles for performance critical architectures, where components must not onlysatisfy behavioural constraints but also performance constraints.

Another important direction of our work is to generalise the simulation model and extendit to support other platforms, such as MPI.

ACKNOWLEDGEMENTS

The authors wish to acknowledge F. Spies, J. Bourgeois, R. Bigeard, and F. Schinkmannfor their contribution to the development of the EDPEPPS environment. Also T. Delaitrewishes to acknowledge his PhD supervisor S. Poslad for advice on the simulation aspectsof this work.

This project has been partially funded by the EPSRC PSTPA programme, under grantGR/K40468, and EC contracts CIPA-CT93-0251 and CP-93-5383.

REFERENCES

1. T. Delaitre, G. R. R. Justo, F. Spies and S. C. Winter, ‘A graphical toolset for simulationmodelling of parallel systems’,Parallel Comput., 22, 1823–1836 (1997).



2. C. Pancake, M. Simmons and J. Yan, ‘Performance evaluation tools for parallel and distributedsystems,Computer28, 16–19 (1995).

3. P. Kacsuk, G. Dozsa and T. Fadgyas, ‘Designing parallel programs by the graphical languageGRAPNEL’, Microprocess. Microprogram., 41, 625–643 (1996).

4. P. Kacsuk, J. C. Cunha, G. Dozsa, J. Lourenco, T. Fadgyas and T. Antao, ‘A graphicaldevelopment and debugging environment for parallel programs’,Parallel Comput., 22, 1747–1770 (1997).

5. SEPP Web Site. http://www.cpc.wmin.ac.uk/∼sepp.6. V. S. Sunderam, ‘PVM: a framework for parallel distributed computing,Concurrency: Pract.

Exp., 2(4), 315–339 (1990).7. P. Pouzet, J. Paris and V. Jorrand, ‘Parallel application design: the simulation approach with

HASTE’, in W. Gentzsch and U. Harms, ed.,HPCN2, 1994, pp. 379–393.8. A. Ferscha and J. Johnson, ‘Performance prototyping of parallel applications in N-MAP’, in

2nd Int. Conf. on Algorithms & Architectures for Parallel Processing, IEEE CS Press, June1996, pp. 84–91.

9. PEPS Partners, ‘PEPS Bulletin, The Bulletin of the Performance Evaluation of Parallel SystemsProject’, EEC PEPS Esprit 6942, 1993.

10. J. Bourgeois, ‘CPU Modelling in EDPEPPS, EDPEPPS EPSRC Project (GR/K40468) D3.1.6(EDPEPPS/35)’, Centre for Parallel Computing, University of Westminster, London, June1997.

11. W. Kuhn and H. Burkhart, ‘The ALPSTONE Project: an overview of a performance modellingenvironment’, in2nd Int. Conf. on HiPC’96, McGraw Hill, 1996, pp. 491–496.

12. H. Burkhart,et al., ‘BACS: Basel algorithm classification scheme, version 1.1’, TechnicalReport 93-3, Universit¨at Basel, URZ+IFI, 1993.

13. N. Medvidovic and R. Taylor, ‘A framework for classifying and comparing architecturesdescription languages’, in6th European Soft. Eng. Conf., ACM Press, Sep. 1997, pp. 60–76.

14. M. Shaw,et al., ‘Abstractions for software architecture and tools to support them’,IEEE Trans.Softw. Eng., 21(4), 314–335 (1995).

15. R. Allen, ‘A formal approach to software architecture’, PhD thesis, School of ComputerScience, Carnegie Mellon University, May 1997.

16. C. A. R. Hoare,Communicating Sequential Processes, Prentice-Hall, 1985.17. G. R. R. Justo and P. R. F. Cunha, ‘Framework for developing extensible and reusable parallel

and distributed applications’, inIEEE Int. Conf. on Algorithms & Architectures for ParallelProcessing, IEEE CS Press, 1996, pp. 29–36.

18. L. F. Pires, ‘Architectural notes: a framework for distributed systems development’, PhD thesis,Centre for Telematics and Inf. Tech., The Netherlands, Sep. 1994.

19. J. Magee, N. Dulay and J. Kramer, ‘A constructive development environment for paralleland distributed programs’, in2nd Int. Workshop on Configurable Distributed Systems, SEI,Carnegie Mellon University, IEEE Computer Society Press, March 1994.

20. M. Shaw and D. Garlan,Software Architecture: Perspectives on an Emerging Discipline,Prentice Hall, 1996.

21. T. Delaitre,et al., ‘Final model definition’, EDPEPPS EPSRC Project (GR/K40468) D3.1.4(EDPEPPS/23), Centre for Parallel Computing, University of Westminster, London, March1997.

22. J. K. Hollingsworth and M. Steele, ‘Grindstone: a test suite for parallel performance tools’,Technical Report UMIACS-TR-96-73, Institute of Advanced Computer Studies and ComputerScience Department, University of Maryland, College Park, 1996.

23. R. H. Saavedra-Barrera and A. J. Smith, ‘Performance characterisation of optimisingcompilers’,IEEE Trans. Softw. Eng., 21(7), (1995).

24. J. C. Cunha, J. Lourenco and T. Antao, ‘A debugging engine for parallel and distributedenvironment’, in Proc. DAPSYS’96, 1st Austrian-Hungarian Workshop on Distributed andParallel Systems, Miskolc, Hungary, 1996, pp. 111–118.

25. F. Bodin,et al., ‘Sage++: an object-oriented toolkit and class library for building Fortran andC++ restructuring tools’,Proc. 2nd Annual Object-Oriented Numerics Conf., 1994.

26. K. Sheehan and M. Esslinger, ‘The SES/simmodeling language’, Proc. Society for ComputerSimulation, San Diego CA, July 1989, pp. 25–32.



27. A. C. Downton, R. W. S. Tregidgo and A. Cuhadar, ‘Top-down structured parallelisation ofembedded image processing applications’,IEE Proc.-Vis. Image Signal Process., 141(6), 431–437 (1994).

28. J. Papay, M. J. Zemerly and G. R. Nudd, ‘Pipelining the Givens linear solver on distributedmemory machines’,Supercomput. J., XII-3 (65), 37–42.

29. R. Hockney and M. Berry, ‘Public International Benchmarks for Parallel Computers Report-1’,Tech. Rep., Parkbench Committee, 1994.

30. CCITT, ‘Draft Revisions of Recommendation H261: video codec for audiovisual services at p× 64 Kbit/s’, Signal Process. Image Commun., 2(2), 221–239 (1990).

31. S. Randoux, http://www.cpc.wmin.ac.uk/˜edpepps/demohtml/ppf.html32. A. Beguelin, J. Dongarra, A. Geist and R. Manchek, ‘HeNCE: a heterogeneous network

computing environment’,Sci. Program., 3(1), 49–60 (1994).33. T. Bemmerl, ‘The TOPSYS architecture’, in H. Burkhart, editor,CONPAR90-VAPPIV

Conference, Zurich, Switzerland, LNCS 457, Springer, Sep. 1995, pp. 732–743.34. R. Suppiet al., ‘Simulation in parallel software design’, Proc. Euro-PDS’97, Barcelona, 1997,

pp. 51–60.35. N. Fang and H. Burkhart, ‘PEMPI: from MPI to standard programming environment’, in

Scalable Parallel Libraries Conference II SPLC’94, Mississippi, 1994, pp. 31–38.36. Ian Foster,Designing and Building Parallel Programs, Addison-Wesley, 1995.37. T. Ludwig,et al., ‘The TOOL-SET – an integrated tool environment for PVM’, inEuroPVM’95,

Lyon, France, Sep. 1995. Tech. Rep. 95-02, Ecole Normale Superieure de Lyon.38. P. Newton and J. Dongarra, ‘Overview of VPE: A visual environment for message-passing’,

Heterogeneous Computing Workshop, 1995.39. J. T. Stasko, ‘The PARADE environment for visualizing parallel program executions’,

Technical Report GITGVU-95-03, Graphics, Visualization and Usability Center, Georgia Inst.of Tech., 1994.


accurate performance prediction using visual prototypes

Documents