module csm23 - grid computing lectures in parallel computing

University of Surrey

Department of Computing

Module CSM23 - Grid Computing

Lectures in Parallel Computing

Dr. Roger M.A. Peel

R.Peel @ surrey.ac.uk

http://www.computing.surrey.ac.uk/personal/st/R.Peel/csm23/index.html

- parallel - 1 - © RMAP 2004

Introduction

In my two lectures, I hope to introduce some of the important issues of any parallelcomputing activity - be it high-perfor mance computing, multiprocessor embeddedsystems or Grid computing:

• Why use Parallel Computing?

• Definitions and Limitations of Parallelism

• Major Classifications of Parallelism

• Communication Schemes

• Deadlock and Livelock

• Applications and Examples


Why use Parallel Computing?

There are several reasons why we might choose to use parallel processing:

• Perfor mance;

• Economy;

• Ease of programming.

This parallelism might come from:

• A pipelined or superscalar processor;

• A vector processor;

• A single-box multiprocessor ;

• Multiple processors connected with communication lines [distributed processing]


Definitions and Limitations of Parallelism

The perfor mance of a distributed parallel processor is influenced by:

• The number and throughput of the processing nodes.

• The bandwidth between the processing nodes. This is measured in Mbytes/sec(or Gbytes/sec) and determines the rate at which data can be sent from onenode to another. It may var y for different pairs of nodes in a system.

• The latency in the node-to-node connections measures their delays intransmission - influenced by the bit rate, the time taken to route data, and thetransmission protocol used. This is measured in seconds.

• The Flow Control strategy used influences perfor mance - nodes must beprevented from flooding each other, or the routing fabr ic, with traffic.

• The Speedup factor for a parallel computer can be defined as

S(n) =Execution time using one processor

Execution time using a multiprocessor with n processors


Definitions and Limitations of Parallelism - Amdahl’s Law

Any parallel code will also have sequential elements - at startup / shutdown, at thebeginning and end of each loop, and so on. Consider how much processing isdone sequentially and in parallel.

Gene Amdahl suggested the following law, relating to vector processors, but it isequally appropriate to VLIW machines, MIMD multiprocessors, networ ks of Gridcomputers and so on.

If the fraction of code in an application that cannot be parallelised is f , and the timetaken for the whole computation on one processor is t , the time taken to perfor mthe computation with n processors is given by

ft + (1 − f )t /n

and the speedup factor is

S(n) =t

ft + (1 − f )t /n=

n1 + (n − 1)f

This ignores the overhead that is usually added by parallelisation, as well as anycommunication costs between multiprocessors - often a significant extra cost.


Definitions and Limitations of Parallelism - Amdahl’s Law

0.0 5.0 10.0 15.0 20.0number of processors

0.0

5.0

10.0

15.0

20.0

spee

dup

S(n

)

f = 0%

f = 5%

f = 10%

f = 20%

Speedup vs No. of Processors


Lessons from Amdahl’s Law

From the above , we can see:

* Even for a huge number of processors (or pipelined vector elements), themaximum speedup is 1/f

* Small reductions in the sequential overhead can make a huge difference inthroughput.

* Reducing the serial & communications overhead (e.g. by runningcommunications and computation in parallel) is ver y beneficial.


Major Classifications of Parallelism

There are only three major classes of parallel processing:

Algor ithmic

we split the algorithm into sections (e.g. pipelining)

Geometr ic

we split the static data space into sections (e.g. process an image on an array ofprocessors)

Processor Far ming

we pass the input data to many processors (e.g. pass ray-tracing coordinates toseveral processors, one ray at a time)

Some parallel applications might combine these techinques.


But first - Load Balancing

There are three for ms of load balancing :

• Static load balancing

The choice of which processor to use for each part of the task is made atcompile time. This is inflexible, but simple.

• Semi-dynamic load balancing

The choice of processor is made at run-time, but once started, each task mustrun to completion on the processor chosen. This is more efficient, but notalways possible.

• Fully-dynamic load balancing

Here, tasks may be interr upted and moved between processors at will. This willenable processors with different capabilities always to be used to bestadvantage (e.g. FPU). The context-switching and communication cost ofinterr upting and moving processes (and their data) may outweigh the gains ofoptimal scheduling.


Algorithmic Parallelism

Many tasks can be split so that a stream of data may be processed in successivestages on a series of processors.

As the first stage finishes its processing, its result is passed to the second stage.The first stage may now accept more input data and process it. When the secondstage finishes, it passes its result on, accepts the result from the first stage and soon. Thus the first task to be submitted passes out of the pipeline in n cycles,where n is the number of processing stages in the pipeline.

When the pipeline is full, one result is produced at every cycle

At the end of continuous operation, the early stages in the pipeline go idle beforethe last result is flushed out.

Load balancing is STATIC - the speed of the pipeline is determined by the speed ofthe slowest stage.


Algorithmic Parallelism

data

data

results

Irregular network (general case)

Pipeline with parallel section

Linear Pipeline (or chain)

results

resultsdata

data


Geometric Parallelism

Some regular-patterned tasks are suitable for processing by spreading their dataacross several processors and perfor ming the same task on each section inparallel.

Many examples involve image processing - spread the pixels of an image acrossan array of transputers and perfor m local averaging, contrast enhancement etc. oneach portion of the image.

Many such tasks involve communication of boundary data from one portion of theimage to another. (e.g. Conway’s Game of Life). It may be possible to duplicatethis edge data.

Load balancing is STATIC - the initial partitioning of the data determines theexecution time to process each area. Rectangular blocks may not be the bestchoice - consider stripes, concentr ic squares, etc.

Initial loading of the data may prove to be a ser ious overhead.


Geometric Parallelism

(Data is distributed amongst the available processors)

Data Array Connected processors


Processor Farming

The third major for m of parallelism involves sharing wor k out from a centralcontroller process to several wor ker processes. The latter just accept packets ofcommand data and return results. The controller splits up the task, sending wor kpackets to free processors (i.e. ones which have just returned a result) andcollating these results. Global data may be sent to all wor ker processes at theoutset.

Processor far ming is only appropriate if :

* The task to be perfor med can be split up into many INDEPENDENT sections.

* The amount of communication (commands + result) takes significantly less timethan the processing itself.

To minimise latency, it may be better to keep two (or maybe three) packets incirculation for each wor ker - buffers are needed around it.

Load balancing is SEMI-DYNAMIC - the command packets are sent to processorswhich have just (or are about to) run out of wor k. Thus all processors are kept busyexcept for the closedown phase (when some finish before others)


A Processor Farm

Controller Workers

Buffers

Return routers

Outgoing routers

Buffers

Place these sections on separate processors


Communication Schemes

Most inter-processor ‘‘communication’’ in a parallel computer can be categorised inone of three ways:

Shared Memory

• A shared memory multiprocessor provides a single memory map - for alloperands or just for synchronisation.

• Processes may wait on semaphore variables until other processes are ready toshare data.

• Once both processes synchronise, the data is copied and the processes releasethe semaphore.


Communication Schemes - 2

Synchronous Message Passing

• This typically uses blocking point-to-point channel communications.

• The first process of the pair to arrive at the channel communication instructionstops, and waits until the second arrives at its comms instruction.

• Then, data flows across the channel.

• Only when all of the data has been transmitted do the two processes continueseparately again.


Communication Schemes - 3

Asynchronous Message Passing

• An asynchronous transmitter sends its messages without blocking, therebysynchronising ver y loosely with the channel.

• The receiver necessarily blocks when it tries to read data from an emptychannel.

• In effect, this method implements a channel with some degree of buffer ing.

These three schemes have strong similarities - it is possible to model each inter ms of the initial two.


Deadlock and Livelock - 1

Consider a program in which two processes attempt to send synchronisedmessages to each other ... but in which neither process is prepared to receive theothers’ message. This is a ver y simple example of a deadlocked system.

.

.c ! 42

.

.

.d ! 42

.

.

c

d

.

In CSP / occam,

ch ! value sends a value along a channel;

ch ? variable receives a value from a channel and stores it to a var iable.



Consider another program in which two processes never communicate on theirexter nal channels, but remain busy internally.

c

d

WHILE TRUE SEQ

d ? y ...proc

c ! x

WHILE TRUE SEQ c ? x

d ! y ...proc

This illustrates livelock. It appears identical to deadlock from the exter nalviewpoint - but additionally consumes processing resources.



Deadlock

• is a common design fault;

• its absence requires reasoning and / or ver ification

- exhaustive testing cannot not wor k

- ver ification requires us to check all the possible circumstances in whichdeadlock could occur.

Livelock is less common than true deadlock

• It is often caused by loops failing to terminate

• Absence of livelock again requires ver ification


Deadlock - Committed Cycles of Communication

Deadlock is possible whenever there is a cycle of processes. If all these processestr y to communicate in the same direction around the cycle at the same time,deadlock sets in.

Common scenarios are where :

• all processes in a cycle output to their neighbours at once, all the outputsbecome committed to completing, and therefore none of the processes will berunning input processes to satisfy these communications;

• all processes in a cycle attempt to input from their neighbours at once, andtherefore none of the processes will be running output processes to supplythese communications.

In full generality, deadlock can occur in any complete cycle of channels, when allthe processes at the same end of these channels commit to communicate at once.


A Small-System Example - the Resettable Timer

Here is an example from the embedded system wor ld - although it is too tightly-coupled for a Grid application.

The environment looks like :

timer usertick

reset

button

The user process is not parallel - it accepts ticks from the clock and button pressstimuli from the timer. Clock ticks provoke some action, whilst button pressescause the timer to be sent a message telling it to change its output rate.

Ever ything seems OK??

• The clock ticks and the user process deals with it.

• The button is pressed, and the new rate is sent down the reset channel. Thetimer process receives the new rate and the value of ’gap’ is set.


Small System Example - what happens?

The occam for this user process might look like this :

WHILE TRUEALTtick ? value

... deal with this time intervalbutton ? anySEQ... calculate new tick ratereset ! rate

Nightmare Scenario

The clock ticks at almost exactly the same time as the button is pressed;

The ALTs in both processes activate;

The timer process tries to send a message on channel tick, and the user processcalculates the new clock rate and then tries to send a message on the resetchannel.

Both processes are committed to sending, and neither is listening, so bothcommunications block - DEADLOCK.


Solution by Buffering

timer usertick

reset

button

buffer (s)

Adding a buffer in the reset channel may appear to offer a solution - the userprocess may now output a reset command, even if the timer is sending a tick,because it will be stored in the buffer until the ALT accepts the tick, upon which thetimer can service the reset channel.

But what if the button key-bounces and the reset channel is sent two values inquick succession? Or 10?? We need an infinite number of buffers to guaranteethat deadlock cannot occur.

Use an OVERWRITING buffer instead. That will absorb any number of resetvalues, eventually passing on the last one (which even saves the timer processfrom dealing with all the earlier ones). This is DEADLOCK FREE.


Deadlock Avoidance - Reasoning about Client-Server Processes

A networ k of processes comprises a number of nodes connected by channels. Ifeach node behaves as a client to some processes to which it is connected, and /or as a server to other processes, then we can derive conditions under which thenetwor k is deadlock-free.

Consider :

Process 1 Process 2 Process 3

server clientserverclient

side side

Here, Process 3 initiates client-server requests to Process 2 and waits untilProcess 2 responds. In the course of serving some (or all) of Process 3’srequests, Process 2 may act as a client by requesting service from Process 1.

Appropr iate definitions for the client-server relationship are :

• A client-ser ver transaction starts when the client initiates the sequence - usuallyby transmitting (although a receive event is also a suitable signal, as it happens,since a communication is a synchronising event).

• When the initiation is complete, the client and server may communicatebetween themselves in any pre-defined sequence we choose to program. Wecannot provoke deadlock if both ends agree on the communication pattern.


Deadlock Avoidance - Reasoning about Client-Server Processes 2

• Eventually, the sequence must come to a close, often with a final server-to-clientresponse, but this last message is optional if both ends otherwise know thesequence is complete.

We may now define the circumstances for which client-server relationships aredeadlock-free :

• The server must always accept an initial request from the client in finite time -otherwise the client will be blocked indefinitely and will see the server asdeadlocked.

• Client processes may not communicate whilst they are engaged in a transaction- although servers may, but only as a client to another server (e.g. to obtaininfor mation which is to be returned to the original client).

Thus, cycles of processes which are committed to communication cannot occur,because clients may nev er ser ve a ser ver which they are indirectly in the processof interrogating.


Deadlock Avoidance - Reasoning about Client-Server Processes 2

This is an invalid client-server cycle.

ClientServer 1Server 2

(acts as a serverto Client and

to Server 2)acts as a client

(acts as a server

(acts as a clientto Server 1)to Server 1)

is in progress with Server 1to Client while Client’s transaction

Server 2 not allowed to act as a client


A Pipeline as a Client-Server Scheme

Notice that neighbouring pipeline elements with unidirectional communicationschannels satisfy the above discussion on deadlock-freedom - provided we don’tjoin the ends.

join these to provoke deadlock!


Client-Ser ver behaviour of an Over-Writing Buffer

Consider a classic overwr iting buffer arrangement :

Memory.Cell Prompter

clientserver server client

The memory cell acts as a server to both sides - in alternating fashion - andtherefore breaks the cycle, even in a chain of pipeline processes.

PARINT val, any:WHILE TRUEALT

in ? valSKIPreq ? any

resp ! val

WHILE TRUESEQreq ! anyresp ? valout | val


Client-Ser ver behaviour of an Over-Writing Buffer - 2

... is deadlock-free


Deadlock Avoidance - I/O-PAR and I/O-SEQ

A process is ‘‘IO-PAR’’ if it perfor ms communications on all of its connectedchannels in parallel. Thus, it has the structure

WHILE runningSEQ... parallel I/O (once on all channels)... compute

A system made up entirely of IO-PAR processes is deadlock-free, even if itcontains cycles, and is itself IO-PAR.


I/O-SEQ

A process is ‘‘IO-SEQ’’ if it perfor ms communications on all of its connected inputchannels in parallel, and then on all of its connected output channels (again, inparallel). For example:

WHILE runningSEQ... parallel inputs

-- (once on all input channels)... compute... parallel outputs

-- (once on all output channels)... compute

A system made up entirely of IO-SEQ processes is deadlock-free, provided that itcontains NO cycles, and is itself IO-SEQ.


Combinations of IO-PAR and IO-SEQ Processes

A networ k made up from a combination of IO-PAR and IO-SEQ components is IO-PAR and will never deadlock, PROVIDED:

(i) that there is no sub-cycle consisting of just IO-SEQ elements

(ii) there is no direct connection from inputs to outputs just via IO-SEQ elements.


Application to Grid Computing

I/O PAR - by itself - has limited applicability

• perfor mance can be lost due to huge communication latencies;

• mismatched processors lead to load balancing problems.

The Client-Server paradigm is good ...

• Processor far ms accommodate mismatched processors;

• Inter-processor latencies can be covered with buffer ing;

• The technique scales well for large systems.


module csm23 - grid computing lectures in parallel computing

Documents