characterization of communication.ppt

Characterization of Communication in Distributed Memory Multiprocessors

Harry.F.Jordan and Gita alaghaband, “Fundamentals of parallel processing”

1. Point-to-Point communication Characteristic of transmission of data: Initiation,

Synchronization, Binding, Buffering and Latency controlo Initiation: By sender, possible by receiver when single

programmer writeso Synchronization: blocking/non blocking send/receiveo Binding: Designating sender and receiver by process id; Message

channel; Receiver process and tag pair(identifies role in computation--Tag-> content, associative, addressing)

o Buffer: Finite capacity, limit messages sent but not yet received o Latency: Time from execution of send operation until message

data arrives at receiver Delivery time for a message is roughly characterized by, T = TS + LTB - - - - - - -> (1.)

TS>>TB -> Very long messages are sent in shorter parts

1. Point-to-Point communication(Cont…)

Summary of Point to Point communication characteristics

Initiation Sender; Receiver request

Synchronization Blocking/Non-blocking send/receive

Binding

Type Associated operations

(source ID, destination ID) Send(destination ID); receive(source ID)

Channel number Open(channel); close(channel); send(channel); receive(channel)

(tag, destination ID) Send( tag, destination ID); receive(tag)

1. Point-to-Point communication(Cont…) Summary of Point to Point communication characteristicsBuffer Type Location Capacity

One per source for each destination

One per channel

One per tag for each destination

Sender: User space Byte limit

System space Message limit

I/O processor

Receiver: User space

System space

I/O processor

Latency Time parameters Transfer through an additional bufferStart-up time Adds to start-up time

Time per byte Adds to time per byte unless pipelined


Message latency, buffering and nonblocking overlap communication and computation

Sent information maximize parallel activity Send is nonblocking then communication and computation

overlapped depends on receiving transmitted data to proceed To effectively overlap message sent as soon as possible and

computation must be independent of communication Message in user’s space copied to buffer which is in system

space or an I/O processor when sender recomputes value in message are used in send

Sender starts to construct a new message in the same area immediately


Each copy of message data between buffers add time to producer’s send operation, message latency and consumer’s receive operation

Choice of buffering depends on:- Software overhead in send and receive Performance characteristic of interconnection network Factor of 2 speed up is obtained by perfect overlap Initiation of new send for further computation in other processes Overlap of many simultaneous communication enabled by

communication/computation overlap (processor support multiple concurrent message)


(a. )Well-overlapped communication

(b. )Poorly overlapped communication

2. Variable classes in distributed memory program

Way in which variables are shared among processes Each variable resides in some processor’s local memory and

private to process running on that processor In SPMD, same variable may have representative in different

processor and representative are updated for same value in all processor

Parallelism class of variable: Private, unique, cooperative update, replicated or partitioned

2. Variable classes(Cont…)

o Private variables: Single name refer to different memory cell and value in each processor

o Unique simple variable: Variable and its values are only defined for one processor

o Structured variables: Individual component unique to single processor

o Cooperative update shared variables: Variable with single value to all processors represented by one cell in processor’s memory, updated is cooperatively performed

– supported by high level and sometimes complex, communication operations


o Shared variable: Value available to many processor by redundant computation (loop index)->replicated

o Replicated variable: Same sequence of values in every processor

o Partitioned variable: loop does not specify combining element from different row in arithmetic operation

Collective communication for variable of one of shared classes is updated

Example: Broadcast allow single processor to give new value to cooperative update variable


real myC, myA, myB, tmpA, tmpB;integer i, j, m, k, N;myC:= 0;for k:=0 step 1 until N-1 /*Loop over inner product terms*/begin

for m:=0 step 1 until N-1 /*Loop over receivers*/begin

if k!=m then{ if j=k then P(i, m) !myA; /*P(i, k) sends A[i, k]*/ if j=m then P(i, k) ?tmpA; /*to P(i, m) for all i*/ if i=k then P(m, j) !myB; /*P(k, j) sends A[k, j]*/ if i=m then P(k, j) ?tmpB; } /*to P(m, j) for all j*/

endif j=k then tmpA:=myA; /*Copy when sender and receiver would be same*/if i=k then tmpB:=myB;

myC:= myC + tmpA * tmpB;end

Distributed memory matrix multiplication using CSP blocking communication


if i=q and j=k thenfor m:=0 step 1 until N-1

if m!=k then send A to P(i, m);if i=q and j!=k then receive A from P(q, k);

Broadcast from P(q, k) to all P(q, m) for m!=k

3. High-level Communication Operations

o Cooperative update variable to Broadcast to assign a new value Communication operations combine values from different

processes and distribute result o Summation or reduction can combine value from each process

and either pass sum to a single root process or distribute it to all processes

o Prefix computation across value from different processes return different but related values to each process in group

o Example: Sum prefix receive one from each of p processes returns to each process an integer in range 0 to p-1

3. High-level Communication Operations(Cont…)

Results might be private or components of a partitioned vector Remapping of structure into different partition over processes

corresponds to permutation of structure components among processes(Matrix multiplication)

o Communication operations: Characterized by source of their input and destination of their output

o Combined operation: Implemented more efficiently than two communication in sequence

o Prefix operation: Related to reduction, produce different value for each destination process


o Scatter operation: Vector of P(number of processor) items, from one process and distribute it, one item to each process

o Reverse operation: Gather collects an item from each process and concatenate them into a vector result delivered to single destination process

o Gather/scatter operation: Remapping a partitioned structure, vector of P source items-> one for each destination process taken from each process, collection is reorganized as vector of item for each destination and are delivered to respective processes


Variable class Update methodsUnique Assignment by one processor

Private Parallel assignment of different values by all processes Prefix computation

Cooperative update Broadcast from a single processor Reduction

Replicated Parallel assignment of same value by all processes

Partitioned Each process assigns to its own componentsPrefix computationPermutation for remapping

Distributed variable classes and methods of updating them


Source DestinationOne process: Single item One process: Single item

Multiple items Multiple items

All processes: Concatenation All processes:

Single item per process

Arithmetic combining Multiple items per process

Characterizing source and destination of collective communications

Communication Source DestinationPoint-to-point One process One process

Broadcast One process: One item All processes: Item per processes

Gather All processes : item per process One process: P items

Scatter One process: P items All processes: Items per process

Reduce All processes: Arithmetic combining

One process: One item

Communication operations and their source and destination types


Different choices for source of item to be communicated and for destination of messages

Specific language or library of communication function exhibit number of variations on communication operation

Source of variation is in data type associated with source or destination type

Vectors of values are supported as source and results Particular arrangement of data is used repeatedly in different

communication , but latency control lead to aggregation of loosely related or unrelated data for specific communication


• Data items merged into output file when read decomposed into individual items by file specific input code(Communication ideas are packing and unpacking message buffer)

• Motivation of packing and unpacking long messages is message start up overhead, irreducible amount of time taken to start sending a message of any length

• Start up time or latency vary with system • Packing: Items of different size and type are concatenated into

one long message• Unpacking: messages are separated in destination


Behavior of some collective communication operation

4. Distributed Gauss elimination

• Solving system of linear equation• Machine has one host processor does all I/O and P worker

processor that are all identical • One process runs on each processor, and all worker processes

execute same program• Machine has communication library supporting high level,

collective operation broadcast, sum reduce and point to point communication

• Communication latency is large compared to floating point operation time (Long message)

• Program below is 2D matrix in column major order• Each worker process has unique id 0≤id<p and host process id is

outside this range

4. Distributed Gauss elimination(Cont…)• Process id communicate in point-to-point mode with process

(id+1)mod p and (id-1)mod p• Solve Ax=b by first factorizing A=LU into Lower(L) and

Upper(U) triangular matrices• Solution obtained by solving two recurrence system forward

substitution Ly=b, followed by backward substitution Ux=y• Matrix A is replaced by L and U is partitioned cyclically over

processor by column with process Pr,0≤r<p

• Mapping of an 8*8 matrix to 3 processors

4. Distributed Gauss elimination(Cont…)

Partition of Gauss elimination problem over processes


Program structure for distributed Gauss elimination


Host program:Input number p of processes and order n of linear system;Broadcast p and n to all workers;Receive time to generate test case from process zero and print it;Receive time for LU factorization from process zero and print time and rate;Receive solution time for test b vector from process zero and print time and rate;Receive residual from process zero, check and print it;End

Host programs for distributed Gauss elimination

4. Distributed Gauss elimination(Cont…)Worker program(process number id):

Receive p and n from host;Compute the number m of matrix columns for this processor;Generate m columns of the test matrix, A[i, j]=1/(i-j+0.5), for Ap of pid;Process zero sends time for generation to host;Call PGEFA procedure to factor A into matrix product LU;

Process zero sends time for factorization to host;Process zero computes test right hand side vector, b[i]=i, i=1, …n;Call PGESL procedure to solve equations, leaving solution vector in process zero;Process zero sends time for solving to host;Call PGEMUL to compute Ax and leave result in process zero;Process zero computes residual, ∑|(Ax)[i]-b[i]|and send it to host; End

Worker programs for distributed Gauss elimination


• Order n^3 work of factorization by stepping sequentially through diagonal element

• Order n^2 work of forward and backward substitution is done sequentially in absence of vector operation

• Host program is unique which does I/O and communicates with workers by broadcast and point-to-point communication

5. Process topology Vs Processor topology

• Process topology is different from processor topology imposed by interconnection network, even if one and only one process runs on each processor

• Communication software make network topology support arbitrary source/destination pair message by forwarding from point to point in network

Let =[ ]𝐴 𝑎𝑖𝑗 × 𝑛 𝑛 and =[ ]𝐵 𝑏𝑖𝑗 × 𝑛 𝑛 be n×n matrices.Compute = 𝐶 𝐴𝐵 Computational complexity of sequential algorithm: ( ^3)𝑂 𝑛

5. Process topology Vs Processor topology(Cont…)int main(int argc,char *argv[]){ int myid,numprocs, left, right;int buffer[10];MPI_Request request;MPI_Status status;MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD, &numprocs);MPI_Comm_rank(MPI_COMM_WORLD, &myid);right = (myid+ 1) %numprocs;left =myid-1;if(left < 0)left =numprocs-1;MPI_Sendrecv_replace(buffer, 10, MPI_INT, left, 123, right, 123,

MPI_COMM_WORLD,&status);MPI_Finalize();return 0;}

5. Process topology Vs Processor topology(Cont…)if j≠0 thenbegin

for k:=0 step 1 until j-1begin

send myB to P((i-1)mod N, j);receive myB from P((i+1)mod N, j);

endendif i≠0 thenbegin

for k:=0 step 1 until j-1begin

send myA to P(i,(j-1)mod N); receive myA from P(i,(j+1)mod N);

endend

Initial distribution using one step left and upward transmissions

characterization of communication.ppt

Engineering