1 non-blocking communications. 2 #include int main(int argc, char **argv) { int my_rank, ncpus; int...

$: 1 Non-Blocking Communications. 2 #include int main(int argc, char **argv) { int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1;$
1

Non-Blocking Communications

2

#include <mpi.h>#include <stdio.h>

int main(int argc, char **argv){ int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1; int tag = 101; MPI_Status statSend, statRecv; MPI_Request reqSend, reqRecv;

MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &ncpus);

left_neighbor = (my_rank-1 + ncpus)%ncpus; right_neighbor = (my_rank+1)%ncpus;

MPI_Isend(&my_rank,1,MPI_INT,left_neighbor,tag,MPI_COMM_WORLD,&reqSend); // comm start MPI_Irecv(&data_received,1,MPI_INT,right_neighbor,tag,MPI_COMM_WORLD,&reqRecv);

// maybe do something useful here

MPI_Wait(&reqSend, &statSend); // complete comm MPI_Wait(&reqRecv, &statRecv); printf("Among %d processes, process %d received from right neighbor: %d\n",

ncpus, my_rank, data_received);

// clean up MPI_Finalize(); return 0;}

Examplempirun –np 4 test_shift

Among 4 processes, process 3 received from right neighbor: 0Among 4 processes, process 2 received from right neighbor: 3Among 4 processes, process 0 received from right neighbor: 1Among 4 processes, process 1 received from right neighbor: 2

3

Semantics etcPurpose:

Mechanism for overlapping communication and useful computations. Communication and computation may proceed concurrently. Latency hiding.

Deadlock avoidanceMay avoid system buffering and memory-to-memory

copying, and improve performance

Structure of non-blocking calls

Post communication requests non-blocking call, MPI_Isend …… // do some useful workComplete communication call MPI_Wait, MPI_Test, …

4

Semantics etc Non-blocking calls: MPI_Isend, MPI_Irecv etc

Will return immediately. Merely post a request to system to initiate communication.

However, communication is not completed yet. Cannot tamper with the memory provided in these calls until the

communication is completed by calling MPI_Wait or MPI_Test etc

Non-blocking send Non-blocking receive

5

Non-blocking Send/Recvint MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)MPI_ISEND(BUF,COUNT,DATATYPE,DEST,TAG,COMM,REQUEST,IERROR) <type> BUF(*) INTEGER COUNT,DATATYPE,DEST,TAG,COMM,REQUEST, IERROR

int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request)MPI_IRECV(BUF,COUNT,DATATYPE,SOURCE,TAG,COMM,REQUEST,IERROR) <type> BUF(*) INTEGER COUNT,DATATYPE,SOURCE,TAG,COMM,REQUEST,IERROR

Post send/recv requests to MPI system. Calls return immediately, but don’t access the memory pointed to by *bufMPI_Request request is a handle to an internal MPI object. Everything about that non-blocking communication is through that handle. MPI_REQUEST_NULL is a NULL request.

MPI_Request req1, req2;double A[10], B[5];…MPI_Isend(A, 10, MPI_DOUBLE, rank, tag, MPI_COMM_WORLD, &req1);MPI_Irecv(B, 5, MPI_DOUBLE, rank, tag, MPI_COMM_WORLD, &req2);

6

Other Non-blocking Sends4 communication modes, same semantics as

blocking sends.MPI_ISEND – standard modeMPI_IBSEND – buffered modeMPI_ISSEND – synchronous modeMPI_IRSEND – ready mode

Identical arguments as MPI_Isend

int MPI_Ibsend(void *buf,int count,MPI_Datatype datatype,int dest, int tag, MPI_Comm comm, MPI_Request *request)int MPI_Issend(void *buf,int count,MPI_Datatype datatype,int dest, int tag, MPI_Comm comm, MPI_Request *request)int MPI_Irsend(void *buf,int count,MPI_Datatype datatype,int dest, int tag, MPI_Comm comm, MPI_Request *request)

7

Completion

Use MPI_Wait or MPI_Test to complete non-blocking communication

Semantics: after MPI_Wait returnsFor standard send, message data has been

safely stored away, safe to access buffer.For receive, data is received.

8

MPI_Wait

Will block until the communication completes (or fails) If request is from MPI_Isend, MPI_Irecv etc

Will deallocate request object, set request to MPI_REQUEST_NULL.

Will return in status the status information. for MPI_Irecv, hold additional information. For MPI_Isend, not much to be used

int MPI_Wait(MPI_Request *request, MPI_Status *status)MPI_WAIT(REQUEST,STATUS,IERROR) INTEGER REQUEST, STATUS(MPI_STATUS_SIZE), IERROR

*request is a handle returned from MPI_Isend, MPI_Irecv etc

MPI_Request req;MPI_Status stat;…MPI_Irecv(…, &req);MPI_Wait(&req, &stat);

9

MPI_Test

request – MPI_Request object from MPI_Isend, etc flag – true if communication complete; false if not yet

If true, request object will be de-allocated, and set to MPI_REQUEST_NULL

status – contain status information if complete

Does not block, return immediately. Provide a mechanism for overlapping communication

and computation Do useful computation; periodically check communication status;

if not complete, go back to computation.

int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status)MPI_TEST(REQUEST,FLAG,STATUS,IERROR) LOGICAL FLAG INTEGER REQUEST, STATUS, IERROR

10

Properties Order: non-overtaking, order preserved

according to the execution order of non-blocking calls that initiate the communications

Progress: guarantees progress Receive call completed by MPI_Wait will eventually return if

there is a matching send. Send call completed by MPI_Wait will eventually return if there is

a matching receive.

MPI_Comm_rank(comm,&rank);If(rank==0) { MPI_Isend(A,1,MPI_DOUBLE,1,99,comm,&req1); MPI_Isend(B,1,MPI_DOUBLE,1,99,comm,&req2);}Else if(rank==1) { MPI_Irecv(A,1,MPI_DOUBLE,0,MPI_ANY_TAG,comm,&req1); MPI_Irecv(B,1,MPI_DOUBLE,0,99,comm,&req2);}MPI_Wait(&req1,&stat1);MPI_Wait(&req2,&stat2);

11

MPI_Wait Variants Deal with arrays of MPI_Requests: MPI_Request req[4]; MPI_Waitall:

MPI_Waitall(int count, MPI_Request *request, MPI_Status *status)

Blocks until all active requests in array complete; return status of all communications Deallocate request objects, set to MPI_REQUEST_NULL

MPI_Waitany: MPI_Waitany(int count,MPI_Request *req, int *index, MPI_Status

*stat) Blocks until one of the active requests in array completes; return its index in array

and the status of completing request; deallocate that request object. If none completes, return index=MPI_UNDEFINED.

MPI_Waitsome: MPI_Waitsome(int incount, MPI_Request *req, int *outcount, int

*array_indices, MPI_Status *array_status) Blocks until at least one of the active communications completes; return associated

indices and status of completed communications; deallocate objects. If none, outcount=MPI_UNDEFINED.

MPI_Request req[2];MPI_Status stat[2];…MPI_Isend(…, &req[0]);MPI_Isend(…, &req[1]);MPI_Waitall(2, req, stat);

MPI_Request req[2];MPI_Status stat;Int index;MPI_Isend(…, &req[0]);MPI_Isend(…, &req[1]);MPI_Waitany(2, req, &index, &stat);…

12

MPI_Test Variants MPI_Testall:

MPI_Testall(int count, MPI_Request *array_req, int *flag, MPI_Status *array_stat)

Return flag=true if all active requests complete; return flag=false otherwise.

If true, will de-allocate request objects, set to MPI_REQUEST_NULL. MPI_Testany:

MPI_Testany(int count, MPI_Request *array_req, int *index, int *flag, MPI_Status *stat)

If one of active comm completes, return flag=true the index and status of completing comm; deallocate that object.

Return flag=false, index=MPI_UNDEFINED if none completes Return flag=true, index=MPI_UNDEFINED if none active requests.

MPI_Testsome: MPI_Testsome(int incount, MPI_Request *array_req, int

*outcount, int *array_indices, MPI_Status *array_stat) Return in outcount the number of completed active comm and associated

indices and status of completing comm. If none completes, return outcount=0 if none active comm, outcount=MPI_UNDEFINED.

13

Persistent Communication Structure for nonblocking calls:

MPI_Ixxxx allocates MPI_Request MPI_Wait or MPI_Test completes and de-allocates request

objects Often a communication with same arguments is

executed repeatedly e.g. every time step or every iteration.

Can create a persistent request that will not be de-allocated by MPI_Wait. Reduce overhead

Create persistent request MPI_Send_init, MPI_Recv_initRepeat: Start communication MPI_Start … Complete communication MPI_Wait, MPI_TestFree persistent request MPI_Request_free

14

Creation

Creates a persistent request object for standard send mode. Bind to the arguments: buf, count, datatype, dest, tag, comm. These

arguments will not change in following communications On creation, request inactive – not associated with any active

communication. Communication initiated by MPI_Start

int MPI_Send_init(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *req)

int MPI_Recv_init(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *req)

MPI_Request req_send, req_recv;double A[100], B[100];int left_neighbor, right_neighbor, tag=999;MPI_Status stat_send, stat_recv;…MPI_Send_init(A,100,MPI_DOUBLE,left_neighbor,tag,MPI_COMM_WORLD,&req_send);MPI_Recv_init(B,100,MPI_DOUBLE,right_neighbor,tag,MPI_COMM_WORLD,&req_recv);MPI_Start(&req_send);MPI_Start(&req_recv);… // do something else usefulMPI_Wait(&req_send, &stat_send);MPI_Wait(&req_recv, &stat_recv);MPI_Request_free(&req_send); MPI_Request_free(&req_recv);

15

Start Communication, Free Request

request is a persistent request created by MPI_Send_init etc.

Start the communication on request object. The call returns immediately. It starts a non-blocking

communication. Should not access the buffer after this call until completion.

Complete communication by MPI_Wait, MPI_Test etc. MPI_Wait, MPI_Test will not de-allocate the request upon

completion of communication De-allocate persistent request using MPI_Request_free in the end.

int MPI_Start(MPI_Request *request)MPI_START(REQUEST) integer REQUEST

int MPI_Request_free(MPI_Request *request)MPI_REQUEST_FREE(request) integer REQUEST

16

Example: Matrix-Vector MultiplicationAX=YA – NxN matrixX,Y – vectors, dimension N

=

A X Y

A11 A12 A13

A21 A22 A23

A31 A32 A33

X1

X2

X3

Y1

Y2

Y3

Y1 = A11*X1 + A12*X2 + A13*X3Y2 = A21*X1 + A22*X2 + A23*X3Y3 = A31*X1 + A32*X2 + A33*X3

=

A11 A12 A13

A21 A22 A23

A31 A32 A33

X2

X3

X1

Y1

Y2

Y3


=

A11 A12 A13

A21 A22 A23

A31 A32 A33

X3

X1

X2

Y1

Y2

Y3


cpu 0

cpu 1

cpu 2

cpu 0

cpu 1

cpu 2

cpu 0

cpu 1

cpu 2

17

Example: Matrix-Vector

Data on cpu 0: [A11 A12 A13] N/3 x N matrix X1 vector, length N/3 Y1 vector, length N/3



Need to communicate: X1, X2, X3Upward shift. Number of shifts = ncpus-1

Assume: A[i][j] = i+j X[i] = i

18

#include <stdio.h>#include <string.h>#include <mpi.h>#include "dmath.h“ // ignore this for now

#define DIM 1000 // logical A[DIM][DIM], X[DIM], Y[DIM]

int main(int argc, char **argv){ int ncpus, my_rank, left_neighbor, right_neighbor, tag=1001; int Nx, Ny; // Ny=DIM, Nx=DIM/ncpus, on each cpu: A[Nx][Ny], X[Nx], Y[Nx] MPI_Request req_sr[2]; MPI_Status stat_sr[2]; double **A, *X, *Y, *Xt;

MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &ncpus);

if(DIM%ncpus != 0) { // assume DIM dividable by ncpus if(my_rank==0) printf("ERROR: grid size cannot be divided by ncpus!\n"); MPI_Finalize(); return -1; } Nx = DIM/ncpus; // again on each cpu: A[Nx][Ny] etc Ny = DIM;

left_neighbor = (my_rank-1 + ncpus)%ncpus; // top neighbor right_neighbor = (my_rank+1)%ncpus; // bottom neighbor

A = DMath::newD(Nx, Ny); // allocate memory, ignore DMath – my own routine X = DMath::newD(Nx); Xt = DMath::newD(Nx); // Xt – temporary space for receiving from neighbor Y = DMath::newD(Nx);

Example(non-blocking comm)

19

int i,j; for(i=0;i<Nx;i++) { // initialize A, X for(j=0;j<Ny;j++) A[i][j] = (my_rank*Nx+i) + j; // *** important *** X[i] = my_rank*Nx+i; } int count; // loop counter int sindex, curr_block; memset(Y, '\0', sizeof(double)*Nx); // zero out result vector Y first for(count=0;count<ncpus;count++){ if(count < ncpus-1) { MPI_Irecv(Xt, Nx, MPI_DOUBLE,right_neighbor,tag,MPI_COMM_WORLD,&req_sr[0]); // receive from bottom neighbor MPI_Isend(X, Nx, MPI_DOUBLE, left_neighbor, tag, MPI_COMM_WORLD, &req_sr[1]); // send to top neighbor }

// compute on current data curr_block = (my_rank+count)%ncpus; // *** important *** sindex = curr_block*Nx; // starting index of A[i][sindex+0:sindex+Nx-1] for(i=0;i<Nx;i++) for(j=0;j<Nx;j++) Y[i] += A[i][sindex+j]*X[j]; // *** important ***

// complete comm if(count<ncpus-1) { MPI_Waitall(2, req_sr, stat_sr); // data now in Xt memcpy(X, Xt, sizeof(double)*Nx); // copy data from Xt to X *** important ** } }

Example

20

Example // clean up, free memory DMath::del(A); // Ignore DMath for now DMath::del(X); DMath::del(Xt); DMath::del(Y);

MPI_Finalize(); return 0;}

21

... MPI_Recv_init(Xt, Nx, MPI_DOUBLE,right_neighbor,tag,MPI_COMM_WORLD,&req_sr[0]); MPI_Send_init(X, Nx, MPI_DOUBLE, left_neighbor, tag, MPI_COMM_WORLD, &req_sr[1]); for(count=0;count<ncpus;count++){ if(count < ncpus-1) MPI_Startall(2, req_sr);

// compute on current data curr_block = (my_rank+count)%ncpus; sindex = curr_block*Nx; for(i=0;i<Nx;i++) for(j=0;j<Nx;j++) Y[i] += A[i][sindex+j]*X[j];

// complete comm if(count<ncpus-1) { MPI_Waitall(2, req_sr, stat_sr); // data now in Xt memcpy(X, Xt, sizeof(double)*Nx); // copy data to X } } MPI_Request_free(&req_sr[0]); MPI_Request_free(&req_sr[1]); ...

Example: Persistent Communication

22

... for(count=0;count<ncpus;count++){

// compute on current data curr_block = (my_rank+count)%ncpus; sindex = curr_block*Nx; for(i=0;i<Nx;i++) for(j=0;j<Nx;j++) Y[i] += A[i][sindex+j]*X[j];

// send-recv if(count<ncpus-1) MPI_Sendrecv_replace(X,Nx,MPI_DOUBLE,left_neighbor,tag, right_neighbor, tag, MPI_COMM_WORLD, &stat_sr); } ...

Example: Send-Recv

23

HWK#2: Matrix Multiplication

A1 A2 A3

B11 B12 B13

B21 B22 B23

B31 B32 B33

C1 C2 C3=

A B C

C1 = A1*B11 + A2*B21 + A3*B31 cpu 0

C2 = A1*B12 + A2*B22 + A3*B32 cpu 1

C3 = A1*B13 + A2*B23 + A3*B33 cpu 2

A, B, C – NxN matricesP – number of processors

A1, A2, A3 – Nx(N/P) matricesC1, C2, C3 - …Bij – (N/P)x(N/P) matrices

Input: A[i][j] = 2*i + j B[i][j] = 2*i – j

Column-wise decomposition

24

HWK #2 Implement the above parallel matrix multiplication (column-wise

data decomposition) in either C, C++ or Fortran Use non-blocking communication or persistent communication in MPI

Test your parallel implementation and make sure the result is correct Result for matrix C on p CPUs must be identical to that on 1 CPU

Use a matrix size 2048x2048 (double) Time the “multiplication section” of your code using MPI_Wtime() routine

for wall-clock time. Run your code on 1, 2, 4, 8, 16 CPUs and obtain the wall-clock time it

takes: T1, T2, …, T16 Compute parallel speedup factors: Sp = T1/Tp, e.g. Sp=T1/T8 for 8

CPUs. Plot Sp vs. number of CPUs.

Turn in: Source code + compiled binary code on either hamlet or radon. Table of wall-clock time vs. number of CPUs. Plot of parallel speedup factors. Write-up of what you have learned from the implementation and timing

results Due date: Oct. 11

25

Collective Communications

26

Overview All processes in a group participate in communication, by

calling the same function with matching arguments. Types of collective operations:

Synchronization: MPI_Barrier Data movement: MPI_Bcast, MPI_Scatter, MPI_Gather, MPI_Allgather, MPI_Alltoall

Collective computation: MPI_Reduce, MPI_Allreduce, MPI_Scan

Collective routines are blocking: Completion of call means the communication buffer can be

accessed No indication on other processes’ status of completion May or may not have effect of synchronization among

processes.

27

Overview

Can use same communicators as PtP communicationsMPI guarantees messages from collective

communications will not be confused with PtP communications.

Key is a group of processes partaking communicationIf you want only a sub-group of processes involved in

collective communication, need to create a sub-group/sub-communicator from MPI_COMM_WORLD

28

Barrier

Blocks the calling process until all group members have called it.

Decreases performance. Refrain from using it explicitly.

int MPI_Barrier(MPI_Comm comm)MPI_BARRIER(COMM,IERROR) integer COMM, IERROR

…MPI_Barrier(MPI_COMM_WORLD); // synchronization point…

29

Broadcast

Broadcasts a message from process with rank root to all processes in group, including itself.

comm, root must be the same in all processes The amount of data sent must be equal to amount of data received,

pairwise between each process and the root For now, means count and datatype must be the same for all

processes; may be different when generalized datatypes are involved.

int MPI_Bcast(void *buffer, int count, MPI_Datatype datatype,int root, MPI_Comm comm)MPI_BCAST(BUFFER, COUNT, DATATYPE, ROOT, COMM) integer BUFFER, COUNT, DATATYPE, ROOT, COMM

1 non-blocking communications. 2 #include int main(int argc, char **argv) { int my_rank, ncpus; int...

Documents

request mpi

request int mpi

rank mpi

buf mpi

ncpus mpi

mpi system

statrecv mpi

argv mpi