module csm23 - grid computing lectures in parallel computing
TRANSCRIPT
University of Surrey
Department of Computing
Module CSM23 - Grid Computing
Lectures in Parallel Computing
Dr. Roger M.A. Peel
R.Peel @ surrey.ac.uk
http://www.computing.surrey.ac.uk/personal/st/R.Peel/csm23/index.html
- parallel - 1 - © RMAP 2004
Introduction
In my two lectures, I hope to introduce some of the important issues of any parallelcomputing activity - be it high-perfor mance computing, multiprocessor embeddedsystems or Grid computing:
• Why use Parallel Computing?
• Definitions and Limitations of Parallelism
• Major Classifications of Parallelism
• Communication Schemes
• Deadlock and Livelock
• Applications and Examples
- parallel - 2 - © RMAP 2004
Why use Parallel Computing?
There are several reasons why we might choose to use parallel processing:
• Perfor mance;
• Economy;
• Ease of programming.
This parallelism might come from:
• A pipelined or superscalar processor;
• A vector processor;
• A single-box multiprocessor ;
• Multiple processors connected with communication lines [distributed processing]
- parallel - 3 - © RMAP 2004
Definitions and Limitations of Parallelism
The perfor mance of a distributed parallel processor is influenced by:
• The number and throughput of the processing nodes.
• The bandwidth between the processing nodes. This is measured in Mbytes/sec(or Gbytes/sec) and determines the rate at which data can be sent from onenode to another. It may var y for different pairs of nodes in a system.
• The latency in the node-to-node connections measures their delays intransmission - influenced by the bit rate, the time taken to route data, and thetransmission protocol used. This is measured in seconds.
• The Flow Control strategy used influences perfor mance - nodes must beprevented from flooding each other, or the routing fabr ic, with traffic.
• The Speedup factor for a parallel computer can be defined as
S(n) =Execution time using one processor
Execution time using a multiprocessor with n processors
- parallel - 4 - © RMAP 2004
Definitions and Limitations of Parallelism - Amdahl’s Law
Any parallel code will also have sequential elements - at startup / shutdown, at thebeginning and end of each loop, and so on. Consider how much processing isdone sequentially and in parallel.
Gene Amdahl suggested the following law, relating to vector processors, but it isequally appropriate to VLIW machines, MIMD multiprocessors, networ ks of Gridcomputers and so on.
If the fraction of code in an application that cannot be parallelised is f , and the timetaken for the whole computation on one processor is t , the time taken to perfor mthe computation with n processors is given by
ft + (1 − f )t /n
and the speedup factor is
S(n) =t
ft + (1 − f )t /n=
n1 + (n − 1)f
This ignores the overhead that is usually added by parallelisation, as well as anycommunication costs between multiprocessors - often a significant extra cost.
- parallel - 5 - © RMAP 2004
Definitions and Limitations of Parallelism - Amdahl’s Law
0.0 5.0 10.0 15.0 20.0number of processors
0.0
5.0
10.0
15.0
20.0
spee
dup
S(n
)
f = 0%
f = 5%
f = 10%
f = 20%
Speedup vs No. of Processors
- parallel - 6 - © RMAP 2004
Lessons from Amdahl’s Law
From the above , we can see:
* Even for a huge number of processors (or pipelined vector elements), themaximum speedup is 1/f
* Small reductions in the sequential overhead can make a huge difference inthroughput.
* Reducing the serial & communications overhead (e.g. by runningcommunications and computation in parallel) is ver y beneficial.
- parallel - 7 - © RMAP 2004
Major Classifications of Parallelism
There are only three major classes of parallel processing:
Algor ithmic
we split the algorithm into sections (e.g. pipelining)
Geometr ic
we split the static data space into sections (e.g. process an image on an array ofprocessors)
Processor Far ming
we pass the input data to many processors (e.g. pass ray-tracing coordinates toseveral processors, one ray at a time)
Some parallel applications might combine these techinques.
- parallel - 8 - © RMAP 2004
But first - Load Balancing
There are three for ms of load balancing :
• Static load balancing
The choice of which processor to use for each part of the task is made atcompile time. This is inflexible, but simple.
• Semi-dynamic load balancing
The choice of processor is made at run-time, but once started, each task mustrun to completion on the processor chosen. This is more efficient, but notalways possible.
• Fully-dynamic load balancing
Here, tasks may be interr upted and moved between processors at will. This willenable processors with different capabilities always to be used to bestadvantage (e.g. FPU). The context-switching and communication cost ofinterr upting and moving processes (and their data) may outweigh the gains ofoptimal scheduling.
- parallel - 9 - © RMAP 2004
Algorithmic Parallelism
Many tasks can be split so that a stream of data may be processed in successivestages on a series of processors.
As the first stage finishes its processing, its result is passed to the second stage.The first stage may now accept more input data and process it. When the secondstage finishes, it passes its result on, accepts the result from the first stage and soon. Thus the first task to be submitted passes out of the pipeline in n cycles,where n is the number of processing stages in the pipeline.
When the pipeline is full, one result is produced at every cycle
At the end of continuous operation, the early stages in the pipeline go idle beforethe last result is flushed out.
Load balancing is STATIC - the speed of the pipeline is determined by the speed ofthe slowest stage.
- parallel - 10 - © RMAP 2004
Algorithmic Parallelism
data
data
results
Irregular network (general case)
Pipeline with parallel section
Linear Pipeline (or chain)
results
resultsdata
data
- parallel - 11 - © RMAP 2004
Geometric Parallelism
Some regular-patterned tasks are suitable for processing by spreading their dataacross several processors and perfor ming the same task on each section inparallel.
Many examples involve image processing - spread the pixels of an image acrossan array of transputers and perfor m local averaging, contrast enhancement etc. oneach portion of the image.
Many such tasks involve communication of boundary data from one portion of theimage to another. (e.g. Conway’s Game of Life). It may be possible to duplicatethis edge data.
Load balancing is STATIC - the initial partitioning of the data determines theexecution time to process each area. Rectangular blocks may not be the bestchoice - consider stripes, concentr ic squares, etc.
Initial loading of the data may prove to be a ser ious overhead.
- parallel - 12 - © RMAP 2004
Geometric Parallelism
(Data is distributed amongst the available processors)
Data Array Connected processors
- parallel - 13 - © RMAP 2004
Processor Farming
The third major for m of parallelism involves sharing wor k out from a centralcontroller process to several wor ker processes. The latter just accept packets ofcommand data and return results. The controller splits up the task, sending wor kpackets to free processors (i.e. ones which have just returned a result) andcollating these results. Global data may be sent to all wor ker processes at theoutset.
Processor far ming is only appropriate if :
* The task to be perfor med can be split up into many INDEPENDENT sections.
* The amount of communication (commands + result) takes significantly less timethan the processing itself.
To minimise latency, it may be better to keep two (or maybe three) packets incirculation for each wor ker - buffers are needed around it.
Load balancing is SEMI-DYNAMIC - the command packets are sent to processorswhich have just (or are about to) run out of wor k. Thus all processors are kept busyexcept for the closedown phase (when some finish before others)
- parallel - 14 - © RMAP 2004
A Processor Farm
Controller Workers
Buffers
Return routers
Outgoing routers
Buffers
Place these sections on separate processors
- parallel - 15 - © RMAP 2004
Communication Schemes
Most inter-processor ‘‘communication’’ in a parallel computer can be categorised inone of three ways:
Shared Memory
• A shared memory multiprocessor provides a single memory map - for alloperands or just for synchronisation.
• Processes may wait on semaphore variables until other processes are ready toshare data.
• Once both processes synchronise, the data is copied and the processes releasethe semaphore.
- parallel - 16 - © RMAP 2004
Communication Schemes - 2
Synchronous Message Passing
• This typically uses blocking point-to-point channel communications.
• The first process of the pair to arrive at the channel communication instructionstops, and waits until the second arrives at its comms instruction.
• Then, data flows across the channel.
• Only when all of the data has been transmitted do the two processes continueseparately again.
- parallel - 17 - © RMAP 2004
Communication Schemes - 3
Asynchronous Message Passing
• An asynchronous transmitter sends its messages without blocking, therebysynchronising ver y loosely with the channel.
• The receiver necessarily blocks when it tries to read data from an emptychannel.
• In effect, this method implements a channel with some degree of buffer ing.
These three schemes have strong similarities - it is possible to model each inter ms of the initial two.
- parallel - 18 - © RMAP 2004
Deadlock and Livelock - 1
Consider a program in which two processes attempt to send synchronisedmessages to each other ... but in which neither process is prepared to receive theothers’ message. This is a ver y simple example of a deadlocked system.
.
.c ! 42
.
.
.d ! 42
.
.
c
d
.
In CSP / occam,
ch ! value sends a value along a channel;
ch ? variable receives a value from a channel and stores it to a var iable.
- parallel - 19 - © RMAP 2004
Deadlock and Livelock - 2
Consider another program in which two processes never communicate on theirexter nal channels, but remain busy internally.
c
d
WHILE TRUE SEQ
d ? y ...proc
c ! x
WHILE TRUE SEQ c ? x
d ! y ...proc
This illustrates livelock. It appears identical to deadlock from the exter nalviewpoint - but additionally consumes processing resources.
- parallel - 20 - © RMAP 2004
Deadlock and Livelock - 3
Deadlock
• is a common design fault;
• its absence requires reasoning and / or ver ification
- exhaustive testing cannot not wor k
- ver ification requires us to check all the possible circumstances in whichdeadlock could occur.
Livelock is less common than true deadlock
• It is often caused by loops failing to terminate
• Absence of livelock again requires ver ification
- parallel - 21 - © RMAP 2004
Deadlock - Committed Cycles of Communication
Deadlock is possible whenever there is a cycle of processes. If all these processestr y to communicate in the same direction around the cycle at the same time,deadlock sets in.
Common scenarios are where :
• all processes in a cycle output to their neighbours at once, all the outputsbecome committed to completing, and therefore none of the processes will berunning input processes to satisfy these communications;
• all processes in a cycle attempt to input from their neighbours at once, andtherefore none of the processes will be running output processes to supplythese communications.
In full generality, deadlock can occur in any complete cycle of channels, when allthe processes at the same end of these channels commit to communicate at once.
- parallel - 22 - © RMAP 2004
A Small-System Example - the Resettable Timer
Here is an example from the embedded system wor ld - although it is too tightly-coupled for a Grid application.
The environment looks like :
timer usertick
reset
button
The user process is not parallel - it accepts ticks from the clock and button pressstimuli from the timer. Clock ticks provoke some action, whilst button pressescause the timer to be sent a message telling it to change its output rate.
Ever ything seems OK??
• The clock ticks and the user process deals with it.
• The button is pressed, and the new rate is sent down the reset channel. Thetimer process receives the new rate and the value of ’gap’ is set.
- parallel - 23 - © RMAP 2004
Small System Example - what happens?
The occam for this user process might look like this :
WHILE TRUEALTtick ? value
... deal with this time intervalbutton ? anySEQ... calculate new tick ratereset ! rate
Nightmare Scenario
The clock ticks at almost exactly the same time as the button is pressed;
The ALTs in both processes activate;
The timer process tries to send a message on channel tick, and the user processcalculates the new clock rate and then tries to send a message on the resetchannel.
Both processes are committed to sending, and neither is listening, so bothcommunications block - DEADLOCK.
- parallel - 24 - © RMAP 2004
Solution by Buffering
timer usertick
reset
button
buffer (s)
Adding a buffer in the reset channel may appear to offer a solution - the userprocess may now output a reset command, even if the timer is sending a tick,because it will be stored in the buffer until the ALT accepts the tick, upon which thetimer can service the reset channel.
But what if the button key-bounces and the reset channel is sent two values inquick succession? Or 10?? We need an infinite number of buffers to guaranteethat deadlock cannot occur.
Use an OVERWRITING buffer instead. That will absorb any number of resetvalues, eventually passing on the last one (which even saves the timer processfrom dealing with all the earlier ones). This is DEADLOCK FREE.
- parallel - 25 - © RMAP 2004
Deadlock Avoidance - Reasoning about Client-Server Processes
A networ k of processes comprises a number of nodes connected by channels. Ifeach node behaves as a client to some processes to which it is connected, and /or as a server to other processes, then we can derive conditions under which thenetwor k is deadlock-free.
Consider :
Process 1 Process 2 Process 3
server clientserverclient
side side
Here, Process 3 initiates client-server requests to Process 2 and waits untilProcess 2 responds. In the course of serving some (or all) of Process 3’srequests, Process 2 may act as a client by requesting service from Process 1.
Appropr iate definitions for the client-server relationship are :
• A client-ser ver transaction starts when the client initiates the sequence - usuallyby transmitting (although a receive event is also a suitable signal, as it happens,since a communication is a synchronising event).
• When the initiation is complete, the client and server may communicatebetween themselves in any pre-defined sequence we choose to program. Wecannot provoke deadlock if both ends agree on the communication pattern.
- parallel - 26 - © RMAP 2004
Deadlock Avoidance - Reasoning about Client-Server Processes 2
• Eventually, the sequence must come to a close, often with a final server-to-clientresponse, but this last message is optional if both ends otherwise know thesequence is complete.
We may now define the circumstances for which client-server relationships aredeadlock-free :
• The server must always accept an initial request from the client in finite time -otherwise the client will be blocked indefinitely and will see the server asdeadlocked.
• Client processes may not communicate whilst they are engaged in a transaction- although servers may, but only as a client to another server (e.g. to obtaininfor mation which is to be returned to the original client).
Thus, cycles of processes which are committed to communication cannot occur,because clients may nev er ser ve a ser ver which they are indirectly in the processof interrogating.
- parallel - 27 - © RMAP 2004
Deadlock Avoidance - Reasoning about Client-Server Processes 2
This is an invalid client-server cycle.
ClientServer 1Server 2
(acts as a serverto Client and
to Server 2)acts as a client
(acts as a server
(acts as a clientto Server 1)to Server 1)
is in progress with Server 1to Client while Client’s transaction
Server 2 not allowed to act as a client
- parallel - 28 - © RMAP 2004
A Pipeline as a Client-Server Scheme
Notice that neighbouring pipeline elements with unidirectional communicationschannels satisfy the above discussion on deadlock-freedom - provided we don’tjoin the ends.
join these to provoke deadlock!
- parallel - 29 - © RMAP 2004
Client-Ser ver behaviour of an Over-Writing Buffer
Consider a classic overwr iting buffer arrangement :
Memory.Cell Prompter
clientserver server client
The memory cell acts as a server to both sides - in alternating fashion - andtherefore breaks the cycle, even in a chain of pipeline processes.
PARINT val, any:WHILE TRUEALT
in ? valSKIPreq ? any
resp ! val
WHILE TRUESEQreq ! anyresp ? valout | val
- parallel - 30 - © RMAP 2004
Client-Ser ver behaviour of an Over-Writing Buffer - 2
... is deadlock-free
- parallel - 31 - © RMAP 2004
Deadlock Avoidance - I/O-PAR and I/O-SEQ
A process is ‘‘IO-PAR’’ if it perfor ms communications on all of its connectedchannels in parallel. Thus, it has the structure
WHILE runningSEQ... parallel I/O (once on all channels)... compute
A system made up entirely of IO-PAR processes is deadlock-free, even if itcontains cycles, and is itself IO-PAR.
- parallel - 32 - © RMAP 2004
I/O-SEQ
A process is ‘‘IO-SEQ’’ if it perfor ms communications on all of its connected inputchannels in parallel, and then on all of its connected output channels (again, inparallel). For example:
WHILE runningSEQ... parallel inputs
-- (once on all input channels)... compute... parallel outputs
-- (once on all output channels)... compute
A system made up entirely of IO-SEQ processes is deadlock-free, provided that itcontains NO cycles, and is itself IO-SEQ.
- parallel - 33 - © RMAP 2004
Combinations of IO-PAR and IO-SEQ Processes
A networ k made up from a combination of IO-PAR and IO-SEQ components is IO-PAR and will never deadlock, PROVIDED:
(i) that there is no sub-cycle consisting of just IO-SEQ elements
(ii) there is no direct connection from inputs to outputs just via IO-SEQ elements.
- parallel - 34 - © RMAP 2004
Application to Grid Computing
I/O PAR - by itself - has limited applicability
• perfor mance can be lost due to huge communication latencies;
• mismatched processors lead to load balancing problems.
The Client-Server paradigm is good ...
• Processor far ms accommodate mismatched processors;
• Inter-processor latencies can be covered with buffer ing;
• The technique scales well for large systems.
- parallel - 35 - © RMAP 2004