plan: i. introduction: programming model ii. basic mpi command iii. examples iv. collective...
TRANSCRIPT
Plan:
I. Introduction: Programming Model
II. Basic MPI Command
III. Examples
IV. Collective Communications
V. More on Communication modes
VI. References on MPI
Basic examples with MPI
M. Garbey
Reference: http://www.mcs.anl.gov/mpi/
The program is executed by one and only one processor
All variables and constants of the program are allocatedin central memory
I Introduction
Definition
Model for a sequential program
Memory
PE
Programme
The program is written in a classical language (Fortran, C, C ++, ….)
The computer is an ensemble of processors with an arbitrary interconnectiontopology
Each processor has its own medium-size local memory.
Each processor executes its own program.
Processors communicate by message passing.
Any processor can send a message to any other processor.
There are no shared resources (CPU, Memory…)
I Introduction
Message Passing Programming Model
I Introduction
Message Passing Programming Model
0 1 2 3
Memory
Processus
Program
Network
I Introduction
Message Passing Programming Model
0 1 2 3
memory
Processus
Single ProgramMultiple Data
Network
I Introduction
Execution model: S P M D
Single Program Multiple Data
The same program is executed by all the processors
Most of the computers can run this model.
It is a particular case of MPMD, but SMPD canemulate MPMD.
If processor is in set A then do piece of code AIf processor is in set B then do piece of code B
…….
I Introduction
Process = Basic Unit of Computation
A program written in a “standard” sequential language with library calls to implement message passing.
A process executes on a node - other processes may execute simultaneously on other nodes
A process communicates and synchronizes with other processes via messages.
A process is uniquely identified by its label
A process does not migrate …..
I Introduction
Processes communicate and synchronize with each other by sending and receiving messages. (No global variables or shared memory)
Processes execute independently and asynchronously (no global synchronizing clock)
Processes my be unique and work on own data set
Any process may communicate with any other process (A priori no limitation on message passing)
I Introduction
Common Communication Patterns
• One processor to one processor
• One processor to many processors
Input data
• Many processors to one processor
Printing results
Global operations
• Many processors to many processors
Algorithm step (FFT …..)
integer code
c -- start MPIcall MPI_INIT(code)
call MPI_Finalize(code)
c -- end MPI
MPI_Init Initialize MPIMPI_Comm_Size gives the number of processesMPI_Comm_Rank give the number of the processMPI_Send Send a messageMPI_Recv Receive a messageMPI_Finalize end MPI environment
II. 6 basic functions of MPI
integer nb_procs, rank, code
c -- gives the number of processes running in the code:
call MPI_COMM_SIZE(MPI_COMM_WORLD, nb_procs, code)
c -- gives the rank of the process running this function:
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, code)
NOTE: 0 =< rank =< nb_procs - 1
NOTE: MPI_COMM_WORLD is for the set of all processes running in the code
MPI_Comm_Size gives the number of processesMPI_Comm_Rank give the rank of the process
II. 6 basic functions of MPI
Program who_i_am implicit none include ‘mpif.h’ integer nb_procs, rang, code call MPI_INIT(code) call MPI_COMM_SIZE(MPI_COMM_WORLD,nb_procs,code) call MPI_COMM_RANK(MPI_COMM_WORLD,rank,code)
print *, ‘ I am the process ‘, rank, ‘among’ , nb_procs call MPI_FINALIZE(code)end program who_i_am
> mpirun -np 4 who_i_am
I am the process 3 among 7I am the process 0 among 4I am the process 2 among 4I am the process 1 among 4
II. 6 basic functions of MPI
MPI_Send Send a messageMPI_Recv Receive a message
II. 6 basic functions of MPI
01
2
5
31000
Program node_to_node implicit none include ‘mpif.h’ integer status(MPI_STATUS_SIZE) integer code, rank, value, tag parameter(tag=100) call MPI_INIT(code) call MPI_COMM_RANK(MPI_COMM_WORLD,rank,code)
if (rank .eq. 1) then value=1000 call MPI_SEND(value,1,MPI_INTEGER, 5 , tag, MPI_COMM_WORLD, code) elseif (rank .eq. 5) thencall MPI_RECEV(value,1, MPI_INTEGER, 1, tag, MPI_COMM_WORLD,statut,code) end if call MPI_FINALIZE(code)end program node_to_node
MPI_Send Send a messageMPI_Recv Receive a message
II. 6 basic functions of MPI
value is the number of type MPI_INTEGER that is sent
each message should have a tag
This protocol of communication is a Synchronous send and a Synchronous receive. MPI_SEND(value,1,MPI_INTEGER, 5 , tag, MPI_COMM_WORLD, code)
blocks the excecution of the code until the send is completed, value can be reused, but no guarantee that message has been received.
MPI_RECEV(value,1, MPI_INTEGER, 1, tag, MPI_COMM_WORLD,status,code) blocks the execution of the code until the receive is completed
NOTE: at the beginning, use print command to check that things are OK!
MPI_Send Send a messageMPI_Recv Receive a message
II. 6 basic functions of MPI
• sender must specify a valid destination rank
• receiver must specify a valid source rank• may use wildcard: MPI_ANY_SOURCE
• the communicator must be the same
• Tags must match• may use wildcard: MPI_ANY_TAG
• Message types must match
• Receiver’s buffer must be large enough
For a communication to succeed:
II. 6 basic functions of MPI
MPI Datatypes Fortran Datatypes
MPI_INTEGER INTEGER
MPI_REAL REAL
MPI_DOUBLE_PRECISION DOUBLE PRECISION
MPI_COMPLEXE COMPLEXE
MPI_LOGICAL LOGICAL
MPI_CHARACTER CHARACTER(1)
MPI Basic Datatypes in Fortran
II. 6 basic functions of MPI
• in Fortran: • double precision MPI_Wtime()• Time is measured in seconds• Time to perform a task is measured by consulting the timer before and after.• Modify your program to measure its execution time andprint out.
Preliminary: TIMER
III. The matrix multiply example:
Example:tstart = mpi_wtime
blabla blablaba…..tend = mpi_wtime
print *, ‘ node ’, myid, ‘ ,time=‘ , tend-tstart, ‘ seconds ’
• Matrix A is copied to every processors j=1..np.
• Matrix B is divided into blocks of columns B and distributed to processors
• Performs matrix multiply simultaneously between A and B
• Output solutions.
Simple matrix multiply algorithm
III. The matrix multiply example:
j=1..np
j
1,2,3,4 * =1 2 3 4 1 2 3 4
A B C
• Master: distribute the work to workers, collect results, and output solution.
• Master sends a copy of A to every worker
do dest=1, numworkers call MPI_SEND(a, nra*nca, mpi_double_precision, dest,mtype,
mpi_comm_world, ierr)end do
• Worker: receive a copy of A from master
call mpi_recv(a, nra*nca, mpi_double_precision, master, mtype, mpi_comm_world, status, ierr)
III. The matrix multiply example:
• Master: distribute block of columns of B to workers
• Master sends column length (cols) and column identifier (offset)
do dest=1, numworkers call MPI_SEND(offset, 1, mpi_integer, dest,mtype,
mpi_comm_world,ierr) call MPI_SEND(cols, 1, mpi_integer, dest,mtype,
mpi_comm_world,ierr)end do
• Master sends corresponding values to workers: do dest=1, numworkers
call MPI_SEND(b(1,offset), cols*nca, mpi_double_precision , dest, mtype, mpi_comm_world,ierr)
end do
III. The matrix multiply example:
• Workers receive the data:
call MPI_RECV(offset, 1, mpi_integer, master, mtype,mpi_comm_world, status, ierr)
call MPI_RECV(cols, 1, mpi_integer, master, mtype,mpi_comm_world, status, ierr)
call MPI_RECV(b, cols*nca, mpi_double_precision , master, mtype, mpi_comm_world, status, ierr)
• Workers do matrix multiply: do k=1, cols c(i,k)=0.0 d0 do j=1, nca c(i,k) = c(i,k) + a(i,j) * b(j,k) end do end do
III. The matrix multiply example:
• Workers send the results for their block back to the master: call MPI_SEND(c, cols*nca, mpi_double_precision , master, mtype, mpi_comm_world, ierr)
• Master receives results from workers:
do i= 1, numworkers
call MPI_RECV(c(1,offset), cols*nca, mpi_double_precision , master, mtype, mpi_comm_world, status, ierr)
end do
Remark: Fortran is not case sensitive
III. The matrix multiply example:
• Substitute for a more complex sequence of calls
• Involve all the processes in a process group
• Called by all processes in a communicator
• all routines block until they are locally complete
• Receive buffers must be exactly the right size
• No message tags are needed
• Collective calls are divided into three subsets:• synchronization• data movement• global computation
IV. Collective Communications:
• To synchronize all processes within a communicator
•A communicator is a group of processes and a context ofcommunication• The base group is the group that contains all processes,which is associated with the MPI_COMM_WORLDcommunicator.
• A node calling it will be blocked until all nodes within the grouphave called it.
Call MPI_BARRIER(comm,ierr)
IV. Collective Communications:
Barrier Synchronization Routines
• One processor sends some data to all processors in a group
call MPI_BCAST(buffer, count, datatype, root, comm, ierr)
• The MPI_BCAST must be called by each node in a group,specifying the same communicator and root. The message is sent from theroot process to all processes in the group, including the rootprocess.
Scatter•Data are distributed into n equal segments, where the ith segmentis sent to the ith process in the group which has n processes.
Call MPI_SCATTER(sbuff,scount, sdatatype, rbuf, rcount, rdatatype, root, comm, ierr)
IV. Collective Communications:
Broadcast
. Gather•Data are collected into a specified process in the order of process rank, • Gather is the reverse process of scatter.
Call MPI_Gather(sbuff,scount, sdatatype, rbuf, rcount, rdatatype, root, comm, ierr)Example:datas in Proc. 0 are: {1,2}, in Proc. 1: {3,4}, in Proc.2: {5,6}, …. in Proc. 5 are {11,12}, thenreal rbuf(2), sbuf(2)call MPI_Gather (sbuf,2,MPI_INIT,rbuf,2,MPI_INIT,3, MPI_COMM_WORLD,ierr) will bring {1,2,3,4,5,6,….,11,12} into Proc. 3.Similarly, the inverse transfer is: call MPI_Scatter (sbuf,2,MPI_INIT,rbuf,2,MPI_INIT,3, MPI_COMM_WORLD,ierr)
IV. Collective Communications:
. Two more MPI functions:MPI_Allgather and MPI_Alltoall:
MPI_Alltoall(sbuf,scount,stype,rbuf,rcount,rtype,comm,ierr)MPI_Alltoall(sbuf,scount,stype,rbuf,rcount,rtype,comm,ierr)sbuf: starting address of send bufferscount: number of elements sent to each processstype: data type to send bufferrbuff: address of receive bufferrcount: number of elements received from any processrtype: data type of receive buffer elementscomm: communicator
To summarize:
IV. Collective Communications:
p0p1
p0p1
p0p1
p0p1
p0p1
p0p1
p0p1
p0p1
a aa
a b ab
ab
a ba b
a bc d d
cba
Broadcast
Scatter
Gather
All Gather
All to All
Global Reduction Routines
• The partial result in each process in the group is combined together using some desired function.• The operation function passed to a global computation routineis either a predefined MPI function or a user supplied function.
• Examples:• Global sum or product.• Global maximum or minimum.• Global user-defined operation.
• MPI_Reduce(sbuf,rbuf,count,stype,op,root, comm,ierr)• MPI_Allreduce(sbuf,rbuf,count,stype,op, comm,ierr)
IV. Collective Communications:
Global Reduction Routines
sbuf : address of send buffer rbuf: address of receive buffer count: the number of elements in the send buffer stype: the data type of elements of send buffer op: the reduce operation function, predefined or user-defined root: the rank of the root process comm: communicator
mpi_reduce returns results to single process
mpi_allreduce returns results to all processes in the group.
IV. Collective Communications:
Global Reduction Routines
IV. Collective Communications:
p0 p1 p2 p0 p1 p2
MPI_Reduce( sendbuf,recvbuf, 4 MPI_INT, MPI_MAX ,0 ,comm)
0 1 2
3457 8 6
9 10
1 2
34
8 6
2
58
9 10
p0 p1 p2 p0 p1 p2
MPI_Allreduce( sendbuf,recvbuf, 4 MPI_INT, MPI_SUM ,comm)
0
0 1 2
3457 8 6
Global Reduction RoutinesExamples
c A subroutine that computes the dot product of two vectors that are distributed acrossc a group of processes and return the answer at node zero:
subroutine PAR_BLAS1(N, a, b, scalar_product, comm) real a(N), b(N), sum, scalar_product
sum=0.0 do I = 1, N sum = sum + a(I) * b(I) end do
call MPI_Reduce(sum, scalar_product, 1, MPI_REAL, 0, MPI_SUM, comm, ierr)
return
IV. Collective Communications:
Global Reduction Routines Predefined Reduce Operations
MPI NAME FUNCTION MPI NAMEFUNCTION
MPI_MAX Maximum MPI_LOR Logical OR
MPI_MIN Minimum MPI_LAND Logical AND
MPI_SUM Sum MPI_PROD Product
IV. Collective Communications:
So far, we have seen standard standard SEND and RECEIVE functions, however we do need to know more in order to overlap communicationsby computations….and more generally optimized the code.
Blocking Calls
• A blocking send or receive call suspends execution of user’s program until the message buffer being sent:received is safe to use.
• In case of a blocking send, this means the data to be sent have been copied out of the send buffer, but they have not necessarly been received in the receiving task. The contents of the send buffer can be modified without affecting the message that was sent.
• The blocking receive implies that the data in the receive buffer are valid.
V. More on Communication Mode:
Blocking Communication Modes:
• Synchronous Send: MPI_SSEND: Return when the message buffer can be safely reused. The sending tasks tells the receiver that a message is ready for it and waits for the receiver to acknowledge.
• System overhead: buffer to network and vice versa.• Synchronization overhead: handshake + waiting.• Safe and Portable.
• Buffered Send: MPI_BSEND: Return when message is copied to the system buffer.
• Standard Send: MPI_SEND: Either synchronous or buffered, implemented by vendor to give good performance for most programs.
• In MPICH: we do have buffered send
V. More on Communication Mode:
Non-Blocking Calls
• Non-blocking calls return immediately after initiating the communication.• In order to reuse the send message buffer, the programmer must check for its status.•The programmer can choose to block before the message buffer is used or test for the status of the message buffer.•A blocking or non_blocking send can be paired to a blocking or non blocking receive.
• Syntax:• call MPI_Isend(buf,count,datatype,dest,tag,comm,handle,ierr)• call MPI_Irecv (buf,count,datatype,src,tag,comm,handle,ierr)
V. More on Communication Mode:
Non-Blocking Calls
• The programmer can block or check for the status of the message buffer:• MPI_Wait(request,status)
• this routine blocks until the communication has completed. They are useful when the data from the communication buffer is about to be re-used.
• MPI_Test(request,flag,status)• This routine blocks until the communication specified by the handle request has completed. The request handle will have been returned by an earlier call to a non_blocking communication routine. The routine queries completion of the communication and the result (True or False) is returned in flag.
V. More on Communication Mode:
V. More on Communication Mode:
Deadlock
-All tasks are waiting for events that haven’t been initiated-Common to SPMD program with blocking communication,e.g every task sends, but none receives-Insufficient system buffer space is available-Remedies :
-Arrange one task to receive-Use MPI_Ssendrecv-Use non-blocking communication
Examples : Deadlock
c Improper use of blocking calls results in deadlock, run on two nodesc author : Roslyn Leibensperger, (CTC) program deadlock
implicit none include ‘mpif.h’
integer MSGLEN, ITAG_A,ITAG_Bparameter (MSGLEN = 2048,ITAG_A=100,ITAG_B=200)real rmsg1(MSGLEN), rmsg2(MSGLEN)integer, irank, idest, isrc, istag, iretag, istatus(MPI_STATUS_SIZE), ierr,Icall MPI_Init (ierr)call MPI_Comm_rank(MPI_COMM_WORLD, irank, ierr)do I = 1, MSGLENrmsg1(I)=100rmsg2(I)= -100end do
V. More on Communication Mode:
Example : Deadline (Cont’d)if (irank.eq.0) then Idest = 1 Isrc = 1 Istag = ITAG_A Iretag = ITAG_Bend if (irank.eq.1) then idest = 0 isrc = 0 istag = ITAG_B iretag = ITAG_Aend if
print*, ‘’ Task ‘’,irank, ‘’has sent the message ‘’ call MPI_Ssend (rmsg1,MSGLEN, MPI_REAL,isrc, iretag, MPI_COMM_WORLD, ierr) call MPI_Recev(rmsg2,MSGLEN,MPI_REAL,isrc, iretag, MPI_COMM_WORLD,istatus,ierr) print*, ‘’Task ‘’,irank, ‘’has received the message ‘’ call MPI_Finalize (ierr)end
V. More on Communication Mode:
Examples : Deadlock (fixed)
c Solution program showing the use of a non-blocking send to eliminate deadlock
c author : Roslyn Leibensperger (CTC) program fixed implicit none include ‘mpif.h’---------------------------------------------- print*, ‘’Task ‘’, irank, ‘’has started the send ‘’ call MPI_isend(rmsg1,MSGLEN, MPI_REAL,idest, istag,MPI_COMM_WORLD,irequest,ierr) call MPI_Recev(rmsg2,MSGLEN,MPI_REAL,isrc, iretag,MPI_COMM_WORLD,irstatus,ierr) call MPI_Wait (irequest,isstatus,ierr) print*, ‘’Task ‘’,irank, ‘’ has completed the send ‘’ call MPI_Finalize(ierr)end
V. More on Communication Mode:
V. More on Communication Mode:
Sendrecv
-Useful for executing a shift operation across a chain of processes.-System take care of possible deadlock due to blocking callMPI_Sendrecv (sbuf,scount,stype,dest, stag,rbuf,rcount,rtype,rtag,comm,status)-sbuf (rbuf): initial address of send (receive) buffer.-scount (rcount): number of elements in send (receive) buffer.-stype (rtype) : type of elements in send (receive) buffer.-stag (rtag): send (receive) tag-dest: rank of destination.-source: rank of source.-comm: communicator-status: status object.
1: program sendrecv2: implicit none3: include ‘mpif.h’4: integer, dimension(MPI_STATUS_SIZE) :: status5: integer, parameter :: tag6: integer :: rank, value, num_proc, code7:8: call MPI_INIT(code)9: call MPI_COMM_RANK(MPI_COMM_WORLD,rank,code)10:11: ! one suppose that we have only two processes.12: num_proc=mod(rank+1,2)13:14: call MPI_SENDRECV(rank+1000,1,MPI_INTEGER,num_proc, tag,value,1,MPI_INTEGER,num_proc,tag, MPI_COMM_WORLD,status,code)15:16: print *,’me, process’,rank, ‘ i have received’, value,’from process’,num_proc17: call MPI_FINALIZE(code)18: end program sendrecvmpirun –np 2 send recvme, process 1 , i have received 1000 from process 0me, process 0 , i have received 1001 from process 1Remark: if Blocking MPI_SEND are implemented in this code, we will have a deadlock because each process will wait for an order of reception that will never come!
Optimizations
Optimization must be a main concern when communications time become a significant part compare to computations time
Optimization of communications may be accomplished at different levels, the main ones are :
1.Overlap communication by computation2.Avoid, if possible, copy of the message in a temporary memory (buffering),3.Minimize additional costs induced by calling subroutines of communication too often
V. More on Communication Mode: