plan: i. introduction: programming model ii. basic mpi command iii. examples iv. collective...

Plan:

I. Introduction: Programming Model

II. Basic MPI Command

III. Examples

IV. Collective Communications

V. More on Communication modes

VI. References on MPI

Basic examples with MPI

M. Garbey

Reference: http://www.mcs.anl.gov/mpi/

The program is executed by one and only one processor

All variables and constants of the program are allocatedin central memory

I Introduction

Definition

Model for a sequential program

Memory

PE

Programme

The program is written in a classical language (Fortran, C, C ++, ….)

The computer is an ensemble of processors with an arbitrary interconnectiontopology

Each processor has its own medium-size local memory.

Each processor executes its own program.

Processors communicate by message passing.

Any processor can send a message to any other processor.

There are no shared resources (CPU, Memory…)

I Introduction

Message Passing Programming Model

I Introduction


0 1 2 3

Memory

Processus

Program

Network

I Introduction


0 1 2 3

memory

Processus

Single ProgramMultiple Data

Network

I Introduction

Execution model: S P M D

Single Program Multiple Data

The same program is executed by all the processors

Most of the computers can run this model.

It is a particular case of MPMD, but SMPD canemulate MPMD.

If processor is in set A then do piece of code AIf processor is in set B then do piece of code B

…….

I Introduction

Process = Basic Unit of Computation

A program written in a “standard” sequential language with library calls to implement message passing.

A process executes on a node - other processes may execute simultaneously on other nodes

A process communicates and synchronizes with other processes via messages.

A process is uniquely identified by its label

A process does not migrate …..

I Introduction

Processes communicate and synchronize with each other by sending and receiving messages. (No global variables or shared memory)

Processes execute independently and asynchronously (no global synchronizing clock)

Processes my be unique and work on own data set

Any process may communicate with any other process (A priori no limitation on message passing)

I Introduction

Common Communication Patterns

• One processor to one processor

• One processor to many processors

Input data

• Many processors to one processor

Printing results

Global operations

• Many processors to many processors

Algorithm step (FFT …..)

integer code

c -- start MPIcall MPI_INIT(code)

call MPI_Finalize(code)

c -- end MPI

MPI_Init Initialize MPIMPI_Comm_Size gives the number of processesMPI_Comm_Rank give the number of the processMPI_Send Send a messageMPI_Recv Receive a messageMPI_Finalize end MPI environment

II. 6 basic functions of MPI

integer nb_procs, rank, code

c -- gives the number of processes running in the code:

call MPI_COMM_SIZE(MPI_COMM_WORLD, nb_procs, code)

c -- gives the rank of the process running this function:

call MPI_COMM_RANK(MPI_COMM_WORLD, rank, code)

NOTE: 0 =< rank =< nb_procs - 1

NOTE: MPI_COMM_WORLD is for the set of all processes running in the code

MPI_Comm_Size gives the number of processesMPI_Comm_Rank give the rank of the process


Program who_i_am implicit none include ‘mpif.h’ integer nb_procs, rang, code call MPI_INIT(code) call MPI_COMM_SIZE(MPI_COMM_WORLD,nb_procs,code) call MPI_COMM_RANK(MPI_COMM_WORLD,rank,code)

print *, ‘ I am the process ‘, rank, ‘among’ , nb_procs call MPI_FINALIZE(code)end program who_i_am

> mpirun -np 4 who_i_am

I am the process 3 among 7I am the process 0 among 4I am the process 2 among 4I am the process 1 among 4


MPI_Send Send a messageMPI_Recv Receive a message


01

2

5

31000

Program node_to_node implicit none include ‘mpif.h’ integer status(MPI_STATUS_SIZE) integer code, rank, value, tag parameter(tag=100) call MPI_INIT(code) call MPI_COMM_RANK(MPI_COMM_WORLD,rank,code)

if (rank .eq. 1) then value=1000 call MPI_SEND(value,1,MPI_INTEGER, 5 , tag, MPI_COMM_WORLD, code) elseif (rank .eq. 5) thencall MPI_RECEV(value,1, MPI_INTEGER, 1, tag, MPI_COMM_WORLD,statut,code) end if call MPI_FINALIZE(code)end program node_to_node



value is the number of type MPI_INTEGER that is sent

each message should have a tag

This protocol of communication is a Synchronous send and a Synchronous receive. MPI_SEND(value,1,MPI_INTEGER, 5 , tag, MPI_COMM_WORLD, code)

blocks the excecution of the code until the send is completed, value can be reused, but no guarantee that message has been received.

MPI_RECEV(value,1, MPI_INTEGER, 1, tag, MPI_COMM_WORLD,status,code) blocks the execution of the code until the receive is completed

NOTE: at the beginning, use print command to check that things are OK!



• sender must specify a valid destination rank

• receiver must specify a valid source rank• may use wildcard: MPI_ANY_SOURCE

• the communicator must be the same

• Tags must match• may use wildcard: MPI_ANY_TAG

• Message types must match

• Receiver’s buffer must be large enough

For a communication to succeed:


MPI Datatypes Fortran Datatypes

MPI_INTEGER INTEGER

MPI_REAL REAL

MPI_DOUBLE_PRECISION DOUBLE PRECISION

MPI_COMPLEXE COMPLEXE

MPI_LOGICAL LOGICAL

MPI_CHARACTER CHARACTER(1)

MPI Basic Datatypes in Fortran


• in Fortran: • double precision MPI_Wtime()• Time is measured in seconds• Time to perform a task is measured by consulting the timer before and after.• Modify your program to measure its execution time andprint out.

Preliminary: TIMER

III. The matrix multiply example:

Example:tstart = mpi_wtime

blabla blablaba…..tend = mpi_wtime

print *, ‘ node ’, myid, ‘ ,time=‘ , tend-tstart, ‘ seconds ’

• Matrix A is copied to every processors j=1..np.

• Matrix B is divided into blocks of columns B and distributed to processors

• Performs matrix multiply simultaneously between A and B

• Output solutions.

Simple matrix multiply algorithm


j=1..np

j

1,2,3,4 * =1 2 3 4 1 2 3 4

A B C

• Master: distribute the work to workers, collect results, and output solution.

• Master sends a copy of A to every worker

do dest=1, numworkers call MPI_SEND(a, nra*nca, mpi_double_precision, dest,mtype,

mpi_comm_world, ierr)end do

• Worker: receive a copy of A from master

call mpi_recv(a, nra*nca, mpi_double_precision, master, mtype, mpi_comm_world, status, ierr)


• Master: distribute block of columns of B to workers

• Master sends column length (cols) and column identifier (offset)

do dest=1, numworkers call MPI_SEND(offset, 1, mpi_integer, dest,mtype,

mpi_comm_world,ierr) call MPI_SEND(cols, 1, mpi_integer, dest,mtype,

mpi_comm_world,ierr)end do

• Master sends corresponding values to workers: do dest=1, numworkers

call MPI_SEND(b(1,offset), cols*nca, mpi_double_precision , dest, mtype, mpi_comm_world,ierr)

end do


• Workers receive the data:

call MPI_RECV(offset, 1, mpi_integer, master, mtype,mpi_comm_world, status, ierr)

call MPI_RECV(cols, 1, mpi_integer, master, mtype,mpi_comm_world, status, ierr)

call MPI_RECV(b, cols*nca, mpi_double_precision , master, mtype, mpi_comm_world, status, ierr)

• Workers do matrix multiply: do k=1, cols c(i,k)=0.0 d0 do j=1, nca c(i,k) = c(i,k) + a(i,j) * b(j,k) end do end do


• Workers send the results for their block back to the master: call MPI_SEND(c, cols*nca, mpi_double_precision , master, mtype, mpi_comm_world, ierr)

• Master receives results from workers:

do i= 1, numworkers

call MPI_RECV(c(1,offset), cols*nca, mpi_double_precision , master, mtype, mpi_comm_world, status, ierr)

end do

Remark: Fortran is not case sensitive


• Substitute for a more complex sequence of calls

• Involve all the processes in a process group

• Called by all processes in a communicator

• all routines block until they are locally complete

• Receive buffers must be exactly the right size

• No message tags are needed

• Collective calls are divided into three subsets:• synchronization• data movement• global computation

IV. Collective Communications:

• To synchronize all processes within a communicator

•A communicator is a group of processes and a context ofcommunication• The base group is the group that contains all processes,which is associated with the MPI_COMM_WORLDcommunicator.

• A node calling it will be blocked until all nodes within the grouphave called it.

Call MPI_BARRIER(comm,ierr)


Barrier Synchronization Routines

• One processor sends some data to all processors in a group

call MPI_BCAST(buffer, count, datatype, root, comm, ierr)

• The MPI_BCAST must be called by each node in a group,specifying the same communicator and root. The message is sent from theroot process to all processes in the group, including the rootprocess.

Scatter•Data are distributed into n equal segments, where the ith segmentis sent to the ith process in the group which has n processes.

Call MPI_SCATTER(sbuff,scount, sdatatype, rbuf, rcount, rdatatype, root, comm, ierr)


Broadcast

. Gather•Data are collected into a specified process in the order of process rank, • Gather is the reverse process of scatter.

Call MPI_Gather(sbuff,scount, sdatatype, rbuf, rcount, rdatatype, root, comm, ierr)Example:datas in Proc. 0 are: {1,2}, in Proc. 1: {3,4}, in Proc.2: {5,6}, …. in Proc. 5 are {11,12}, thenreal rbuf(2), sbuf(2)call MPI_Gather (sbuf,2,MPI_INIT,rbuf,2,MPI_INIT,3, MPI_COMM_WORLD,ierr) will bring {1,2,3,4,5,6,….,11,12} into Proc. 3.Similarly, the inverse transfer is: call MPI_Scatter (sbuf,2,MPI_INIT,rbuf,2,MPI_INIT,3, MPI_COMM_WORLD,ierr)


. Two more MPI functions:MPI_Allgather and MPI_Alltoall:

MPI_Alltoall(sbuf,scount,stype,rbuf,rcount,rtype,comm,ierr)MPI_Alltoall(sbuf,scount,stype,rbuf,rcount,rtype,comm,ierr)sbuf: starting address of send bufferscount: number of elements sent to each processstype: data type to send bufferrbuff: address of receive bufferrcount: number of elements received from any processrtype: data type of receive buffer elementscomm: communicator

To summarize:


p0p1

p0p1

p0p1

p0p1

p0p1

p0p1

p0p1

p0p1

a aa

a b ab

ab

a ba b

a bc d d

cba

Broadcast

Scatter

Gather

All Gather

All to All

Global Reduction Routines

• The partial result in each process in the group is combined together using some desired function.• The operation function passed to a global computation routineis either a predefined MPI function or a user supplied function.

• Examples:• Global sum or product.• Global maximum or minimum.• Global user-defined operation.

• MPI_Reduce(sbuf,rbuf,count,stype,op,root, comm,ierr)• MPI_Allreduce(sbuf,rbuf,count,stype,op, comm,ierr)



sbuf : address of send buffer rbuf: address of receive buffer count: the number of elements in the send buffer stype: the data type of elements of send buffer op: the reduce operation function, predefined or user-defined root: the rank of the root process comm: communicator

mpi_reduce returns results to single process

mpi_allreduce returns results to all processes in the group.




p0 p1 p2 p0 p1 p2

MPI_Reduce( sendbuf,recvbuf, 4 MPI_INT, MPI_MAX ,0 ,comm)

0 1 2

3457 8 6

9 10

1 2

34

8 6

2

58

9 10

p0 p1 p2 p0 p1 p2

MPI_Allreduce( sendbuf,recvbuf, 4 MPI_INT, MPI_SUM ,comm)

0

0 1 2

3457 8 6

Global Reduction RoutinesExamples

c A subroutine that computes the dot product of two vectors that are distributed acrossc a group of processes and return the answer at node zero:

subroutine PAR_BLAS1(N, a, b, scalar_product, comm) real a(N), b(N), sum, scalar_product

sum=0.0 do I = 1, N sum = sum + a(I) * b(I) end do

call MPI_Reduce(sum, scalar_product, 1, MPI_REAL, 0, MPI_SUM, comm, ierr)

return


Global Reduction Routines Predefined Reduce Operations

MPI NAME FUNCTION MPI NAMEFUNCTION

MPI_MAX Maximum MPI_LOR Logical OR

MPI_MIN Minimum MPI_LAND Logical AND

MPI_SUM Sum MPI_PROD Product


So far, we have seen standard standard SEND and RECEIVE functions, however we do need to know more in order to overlap communicationsby computations….and more generally optimized the code.

Blocking Calls

• A blocking send or receive call suspends execution of user’s program until the message buffer being sent:received is safe to use.

• In case of a blocking send, this means the data to be sent have been copied out of the send buffer, but they have not necessarly been received in the receiving task. The contents of the send buffer can be modified without affecting the message that was sent.

• The blocking receive implies that the data in the receive buffer are valid.

V. More on Communication Mode:

Blocking Communication Modes:

• Synchronous Send: MPI_SSEND: Return when the message buffer can be safely reused. The sending tasks tells the receiver that a message is ready for it and waits for the receiver to acknowledge.

• System overhead: buffer to network and vice versa.• Synchronization overhead: handshake + waiting.• Safe and Portable.

• Buffered Send: MPI_BSEND: Return when message is copied to the system buffer.

• Standard Send: MPI_SEND: Either synchronous or buffered, implemented by vendor to give good performance for most programs.

• In MPICH: we do have buffered send


Non-Blocking Calls

• Non-blocking calls return immediately after initiating the communication.• In order to reuse the send message buffer, the programmer must check for its status.•The programmer can choose to block before the message buffer is used or test for the status of the message buffer.•A blocking or non_blocking send can be paired to a blocking or non blocking receive.

• Syntax:• call MPI_Isend(buf,count,datatype,dest,tag,comm,handle,ierr)• call MPI_Irecv (buf,count,datatype,src,tag,comm,handle,ierr)


Non-Blocking Calls

• The programmer can block or check for the status of the message buffer:• MPI_Wait(request,status)

• this routine blocks until the communication has completed. They are useful when the data from the communication buffer is about to be re-used.

• MPI_Test(request,flag,status)• This routine blocks until the communication specified by the handle request has completed. The request handle will have been returned by an earlier call to a non_blocking communication routine. The routine queries completion of the communication and the result (True or False) is returned in flag.



Deadlock

-All tasks are waiting for events that haven’t been initiated-Common to SPMD program with blocking communication,e.g every task sends, but none receives-Insufficient system buffer space is available-Remedies :

-Arrange one task to receive-Use MPI_Ssendrecv-Use non-blocking communication

Examples : Deadlock

c Improper use of blocking calls results in deadlock, run on two nodesc author : Roslyn Leibensperger, (CTC) program deadlock

implicit none include ‘mpif.h’

integer MSGLEN, ITAG_A,ITAG_Bparameter (MSGLEN = 2048,ITAG_A=100,ITAG_B=200)real rmsg1(MSGLEN), rmsg2(MSGLEN)integer, irank, idest, isrc, istag, iretag, istatus(MPI_STATUS_SIZE), ierr,Icall MPI_Init (ierr)call MPI_Comm_rank(MPI_COMM_WORLD, irank, ierr)do I = 1, MSGLENrmsg1(I)=100rmsg2(I)= -100end do


Example : Deadline (Cont’d)if (irank.eq.0) then Idest = 1 Isrc = 1 Istag = ITAG_A Iretag = ITAG_Bend if (irank.eq.1) then idest = 0 isrc = 0 istag = ITAG_B iretag = ITAG_Aend if

print*, ‘’ Task ‘’,irank, ‘’has sent the message ‘’ call MPI_Ssend (rmsg1,MSGLEN, MPI_REAL,isrc, iretag, MPI_COMM_WORLD, ierr) call MPI_Recev(rmsg2,MSGLEN,MPI_REAL,isrc, iretag, MPI_COMM_WORLD,istatus,ierr) print*, ‘’Task ‘’,irank, ‘’has received the message ‘’ call MPI_Finalize (ierr)end


Examples : Deadlock (fixed)

c Solution program showing the use of a non-blocking send to eliminate deadlock

c author : Roslyn Leibensperger (CTC) program fixed implicit none include ‘mpif.h’---------------------------------------------- print*, ‘’Task ‘’, irank, ‘’has started the send ‘’ call MPI_isend(rmsg1,MSGLEN, MPI_REAL,idest, istag,MPI_COMM_WORLD,irequest,ierr) call MPI_Recev(rmsg2,MSGLEN,MPI_REAL,isrc, iretag,MPI_COMM_WORLD,irstatus,ierr) call MPI_Wait (irequest,isstatus,ierr) print*, ‘’Task ‘’,irank, ‘’ has completed the send ‘’ call MPI_Finalize(ierr)end



Sendrecv

-Useful for executing a shift operation across a chain of processes.-System take care of possible deadlock due to blocking callMPI_Sendrecv (sbuf,scount,stype,dest, stag,rbuf,rcount,rtype,rtag,comm,status)-sbuf (rbuf): initial address of send (receive) buffer.-scount (rcount): number of elements in send (receive) buffer.-stype (rtype) : type of elements in send (receive) buffer.-stag (rtag): send (receive) tag-dest: rank of destination.-source: rank of source.-comm: communicator-status: status object.

1: program sendrecv2: implicit none3: include ‘mpif.h’4: integer, dimension(MPI_STATUS_SIZE) :: status5: integer, parameter :: tag6: integer :: rank, value, num_proc, code7:8: call MPI_INIT(code)9: call MPI_COMM_RANK(MPI_COMM_WORLD,rank,code)10:11: ! one suppose that we have only two processes.12: num_proc=mod(rank+1,2)13:14: call MPI_SENDRECV(rank+1000,1,MPI_INTEGER,num_proc, tag,value,1,MPI_INTEGER,num_proc,tag, MPI_COMM_WORLD,status,code)15:16: print *,’me, process’,rank, ‘ i have received’, value,’from process’,num_proc17: call MPI_FINALIZE(code)18: end program sendrecvmpirun –np 2 send recvme, process 1 , i have received 1000 from process 0me, process 0 , i have received 1001 from process 1Remark: if Blocking MPI_SEND are implemented in this code, we will have a deadlock because each process will wait for an order of reception that will never come!

Optimizations

Optimization must be a main concern when communications time become a significant part compare to computations time

Optimization of communications may be accomplished at different levels, the main ones are :

1.Overlap communication by computation2.Avoid, if possible, copy of the message in a temporary memory (buffering),3.Minimize additional costs induced by calling subroutines of communication too often


plan: i. introduction: programming model ii. basic mpi command iii. examples iv. collective...

Documents