qoscosgrid - barcelona25 october 2006 mpich-v project pierre lemarinier [email protected]...
TRANSCRIPT
QosCosGrid - Barcelona25 October 2006
MPICH-V Project
http://www.mpich-v.net
Pierre Lemarinier [email protected] de Recherche en Informatique University Paris SouthINRIA Futurs
MMessageessagePPassingassingIInterfacenterface
StandardStandardDescriptionDescriptionPerformance
QosCosGrid - Barcelona 225 October 2006
Contents Introduction to MPI
Message passing Different type of communication MPI functionalities
MPI structures Basic functions Data types Contexts and tags Groups and communication domains
Communication functions Point to point communications Asynchronous communications Global communications
MPI-2 One-sided communications I/O Dynamicity
QosCosGrid - Barcelona 325 October 2006
Message passing (1) Problem :
We have N nodes All nodes connected by network
How to use the global computer gathering the N nodes ?
RAM
CPU
RAM
CPU
RAM
CPU
RAM
CPU
RAM
CPU
RAM
CPU
RAM
CPU
Network
QosCosGrid - Barcelona 425 October 2006
Message passing (2)One answer : message passing
Execute one process per processorExchange explicitly data between processorsSynchronize explicitly the different processes
Two types of data transfer :Only one process initiate the communication: ‘one sided’ The two processes cooperate for the communication:
‘cooperative’
QosCosGrid - Barcelona 525 October 2006
Two types of data transfer ‘one sided’ communications
No Rendez-vous protocol No warning about reading or
writing actions inside local memory for a process
Costly synchronization
Functions prototypes : put(remote_process, data) get(remote_process, data)
Cooperatives Communications The communication involves
the two processes Implicit synchronization in the
simple case
Functions prototypes : send(destination, data) recv(source, data)
CPU CPU
put()
CPU CPU
get()
CPU CPU
send() recv()
QosCosGrid - Barcelona 625 October 2006
MPI (Message Passing Interface) Standard developed by academics and industrial partners
Objective: to specify a portable message passing library
Imply an execution environment for launching and connecting together all the processes
Allow: Synchronous and asynchronous communications Global communications
Separated communication domains
QosCosGrid - Barcelona 725 October 2006
Contents Introduction to MPI
Message passing Different type of communication MPI functionalities
MPI structures Basic functions (exemple HelloWorld_MPI.c) Data types Contexts and tags Groups and communication domains
Communication functions Point to point communications Asynchronous communications Global communications
MPI-2 One-sided communications I/O Dynamicity
QosCosGrid - Barcelona 825 October 2006
MPI Programming Structure Follows the SPMD programming model
All processes are launched at the same time Same program for every processors Can differentiate processors roles by a rank number
Sequential section
MPI initialization
Parallel initialization
Computation
Communications
Synchronization
End of parallel section
Sequential section
Non parallel section
Remark: Most implementations advise to limit this program part to the exit call
Multinode parallel section (MPI)
Parallel section initialization
Parallel section termination
QosCosGrid - Barcelona 925 October 2006
Basic functions MPI environment initialization
C : MPI_Init(&argc, char &argv); Fortran : call MPI_Init(ierror)
MPI Environment termination (program are recommended to exit after this function call) C : MPI_Finalize(); Fortran : call MPI_Finalize(ierror)
Getting the process rank C : MPI_Comm_rank(MPI_COMM_WORLD, &rank); Fortran : call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierror)
Getting the total number of processes C : MPI_Comm_size(MPI_COMM_WORLD, &size); Fortran : call MPI_comm_size(MPI_COMM_WORLD, size, ierror)
QosCosGrid - Barcelona 1025 October 2006
HelloWorld_MPI.c#include <stdio.h>
#include <mpi.h>
void main(int argc, char ** argv) {
int rang, nprocs;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rang);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
printf(“hello, I am %d (Of %d processes)\n”, rang, nprocs);
MPI_Finalize();
}
QosCosGrid - Barcelona 1125 October 2006
MPI data types
MPI_PACKED
MPI_BYTE
long doubleMPI_LONG_DOUBLE
doubleMPI_DOUBLE
floatMPI_FLOAT
unsigned long intMPI_UNSIGNED_LONG
unsigned intMPI_UNSIGNED
unsigned short intMPI_UNSIGNED_SHORT
unsigned charMPI_UNSIGNED_CHAR
signed long intMPI_LONG
signed intMPI_INT
signed short intMPI_SHORT
signed charMPI_CHAR
Type CType MPI
MPI_PACKED
MPI_BYTE
CHARACTER(1)MPI_CHARACTER
LOGICALMPI_LOGICAL
COMPLEXMPI_COMPLEX
DOUBLE PRECISIONMPI_DOUBLE_PRECISION
REALMPI_REAL
INTEGERMPI_INTEGER
Type FORTRANType MPI
QosCosGrid - Barcelona 1225 October 2006
User data types By default: MPI exchanges data using vector of MPI data
It is possible to create data types to simplify communication operations (simplifying buffer and linearization operations)
User data types replace the obsolete MPI_PACK type
A user type consists in a sequence of basic types and a sequence of offsets for fitting the memory creation : MPI_Type_commit(type) ; Destruction : MPI_Type_free(type) ;
QosCosGrid - Barcelona 1325 October 2006
Contexts and tags Need to distinguish different messages in reception
Context allow to distinguish between a point-to-point communication and a global communication
Every message is sent in a within a context, and must be received in the same context
Context is automatically managed by MPI
The communication tags allow to identify one communication among multiple ones
When communication are made asynchronously, this tags allow to sort them
For reception operations, we can received the next message by specifying the MPI_ANY_TAG keyword
Tag management is up to the MPI programmer
QosCosGrid - Barcelona 1425 October 2006
Communication domains Nodes can be grouped in a communication domain called
communicator
Every process as a rank number per group it is involved in
MPI_COMM_WORLD is the default communication domain gathering all processes and created at the initialization.
More generally, All operations can only be made on a single set of processes specified by their communicator
Each domain constitutes an distinct specific context for communications
QosCosGrid - Barcelona 1525 October 2006
Split a communicator (1/2): groups To create a new domain, first you have to create a new
group of processes: int MPI_Comm_group(MPI_Comm comm, MPI_Group *group); int MPI_Group_incl(MPI_Group group, int rsize, int
*ranks, MPI_Group *newgroup); int MPI_Group_excl(MPI_Group group, int rsize, int
*ranks, MPI_Group *newgroup); Set of operations on the groups:
int MPI_Group_union(MPI_Group g1, MPI_Group g2, MPI_Group *gr) ;
int MPI_Group_intersection(MPI_Group g1, MPI_Group g2, MPI_Group *gr) ;
int MPI_Group_difference(MPI_Group g1, MPI_Group g2, MPI_Group *gr) ;
Destruction of a group: int MPI_Group_free(MPI_Group *group) ;
QosCosGrid - Barcelona 1625 October 2006
Split a communicator (2/2): communicators Associating a communicator to a group:
int MPI_Comm_create(MPI_Comm comm, MPI_Group group, MPI_Comm *newcomm) ;
Dividing a domain in sub-domains: int MPI_Comm_split(MPI_Comm comm, int color, int key,
MPI_Comm *newcomm) ;MPI_Comm_split is a collective operation on the initial
communicator commEvery process gives its color, Every process of the same color are
then in the same newcommThe MPI_UNDEFINED color allows for a process to not be part of the
new communicatorEvery process gives its key, Processes of the same color are ranked
by these keysA group is implicitly created for each new communicator created
this way
Communicators destruction: int MPI_Comm_free(MPI_Comm *comm) ;
QosCosGrid - Barcelona 1725 October 2006
Contents Introduction to MPI
Message passing Different type of communication MPI functionalities
MPI structures Basic functions Data types Contexts and tags Groups and communication domains
Communication functions Point to point communications (exemple Jeton.c) Asynchronous communications Global communications (exemple trace.c)
MPI-2 One-sided communications I/O Dynamicity
QosCosGrid - Barcelona 1825 October 2006
Point-to-point communications Send and receive data between a pair of processes
The two processes initiates the communication, one sends the data, the other asks for the reception
Communications are identified by tags
The type and the size of the data must be specified
QosCosGrid - Barcelona 1925 October 2006
Basic communication functions Synchronous sending (between the computation process
and the action of sending): int MPI_Send(void* buf, int count, MPI_Datatype
datatype, int dest, int tag, MPI_Comm comm) ; The tag allow unique identifying of messages
Synchronous data reception: int MPI_Recv(void* buf, int count, MPI_Datatype
datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) ;
The tag must be identical to the tag sent MPI_ANY_SOURCE can be specified to receive from anyone
QosCosGrid - Barcelona 2025 October 2006
Jeton.c#include <stdio.h>
#include <mpi.h>
void main(int argc, char ** argv) {
int me, prec, suiv, np;
int jeton = 0;
MPI_Status * status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &me);
MPI_Comm_size(MPI_COMM_WORLD, &np);
if (me == 0)
prec = np – 1;
else
prec = me – 1;
if (me == np - 1)
suiv = 0;
else
suiv = me + 1;
if (me == 0)
MPI_Send(&jeton, 1, MPI_INT, suiv, 0, MPI_COMM_WORLD,);
while (1) {
MPI_Recv(&jeton, 1, MPI_INT, prec, 0, MPI_COMM_WORLD, status);
MPI_Send(&jeton, 1, MPI_INT, suiv, 0, MPI_COMM_WORLD);
}
MPI_Finalize();
}
3
4
5
21
0n
p -1
QosCosGrid - Barcelona 2125 October 2006
Synchronism and asynchronism (1) To solve some deadlocks, and to allow le recouvrement des
communications par le calcul, one can use non blocking functions
In this case, the communication scheme is the following: Initialization of the non blocking communication (by either the
two or one of the process) The communication (non blocking or blocking) is called by
other process … computation Termination of the communication (Blocking operation until the
communication is performed)
QosCosGrid - Barcelona 2225 October 2006
Synchronism and asynchronism (2) Non blocking functions :
int MPI_Isend(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request);
int MPI_Irecv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request) ;
The request field is used to know the state of a non blocking communication. To wait for its termination, one can call the following function: int MPI_Wait(MPI_Request *request, MPI_Status
*status) ;
QosCosGrid - Barcelona 2325 October 2006
Synchronism and asynchronism (3)
Data can be exchanged by blocking or non blocking functions. There are multiple functions to manage how the send and the receive operation are coupled
To fix the communication mode, you use prefix (MPI_[*]Send): Synchronous send ([S]) : finished when the coresponding receive is
posted (hard coupled to the reception, without buffers) Buffered send ([B]) : a buffer is created, the send operation ends when
the user buffer is copied to the system buffer (not coupled to the reception)
Standard send () : the send ends when the emission buffer is empty (MPI implementation decides for buffering or coupling to reception)
Ready send ([R]) : User assures that reception request is already posted when calling this function (coupled to the reception without buffer)
QosCosGrid - Barcelona 2425 October 2006
Collective or global operations
To simplify communication operation involving multiple processes, one can use collective operations on a communicator
Typical operations: reductions
Data exchange:BroadcastScatterGatherAll-to-All
Explicit synchronization
QosCosGrid - Barcelona 2525 October 2006
Reductions (1) A reduction is an arithmetic operation on the distributed
data made by a set of processors Prototype :
C : int MPI_Reduce(void * sendbuf, void* recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm communicator);
Fortran : MPI_Reduce(sendbuf, recvbuf, count, datatype, op, root, communicator, ierror)
Using MPI_Reduce(), only the root processor gets the result
With MPI_AllReduce(), all processes get the result
QosCosGrid - Barcelona 2625 October 2006
Reductions (2) Available operations
Maximum and localizationMPI_MAXLOC
Minimum and localizationMPI_MINLOC
Bit/bit exclusive orMPI_BXOR
Logical exclusive orMPI_LXOR
Bit/bit orMPI_BOR
Logical orMPI_LOR
Bit/bit andMPI_BAND
Logical andMPI_LAND
Product element by elementMPI_PROD
SumMPI_SUM
MaximumMPI_MAX
MinimumMPI_MIN
OperationMPI_Op
QosCosGrid - Barcelona 2725 October 2006
Broadcast A broadcast operation allows to distribute the same data to all
processes
One-to-all communication, from a specified process ‘root’ to all processes of a communicator
Prototypes : C : int MPI_Bcast(void *buffer, int count, MPI_Datatype
datatype, int root, MPI_Comm comm); Fortran : MPI_Bcast(buffer, count, datatype, root,
communicator, ierror)
0 1 2 3 np-1
root = 1
0 1 2 3 np-1
buffer
QosCosGrid - Barcelona 2825 October 2006
Scatter One-to-all operation, different data are sent to each receiver process
according to their rank Prototypes :
C : int MPI_Scatter(void * sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm communicator);
Fortran : MPI_Scatter(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, communicator, ierror)
The ‘send’ parameters are used by only the sender process
sendbuf
recvbuf
0 1 2 3 np-1
root = 2
0 1 2 3 np-1
QosCosGrid - Barcelona 2925 October 2006
Gather All-to-one operation, different data are received by a receiver process Prototypes :
C : int MPI_Gather(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm communicator);
Fortran : MPI_Gather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, communicator, ierror)
The ‘receive’ parameters are only used by the receiver process
sendbuf
recvbuf
0 1 2 3 np-1
root = 3
0 1 2 3 np-1
QosCosGrid - Barcelona 3025 October 2006
All-to-All All-to-all operation, different data are sent to each process,
according to their rank Prototypes :
C : int MPI_AlltoAll(void * sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm communicator);
Fortran : MPI_Alltoall(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, communicator, ierror)
sendbuf
recvbuf
0 1 2 3 np-1 0 1 2 3 np-1
QosCosGrid - Barcelona 3125 October 2006
Explicit Synchronization Synchronization barrier : All processes of a communicator
waits for the last process to enter the barrier before continuing their execution
For computer with material barrier available (such as SGI and Cray T3E), the MPI barrier is slower than these material barrier
Prototype C : int MPI_Barrier (MPI_Comm communicator); Fortran : MPI_Barrier(Communicator, IERROR)
QosCosGrid - Barcelona 3225 October 2006
Matrix trace (1) Computing the trace of a matrix An
The matrix trace is the sum of the diagonal element (square matrix)
One can easily see that the sum can be made on multiple processor, ending by using a reduction to compute the complete trace
n
k
kkAATrace1
,)(
QosCosGrid - Barcelona 3325 October 2006
Matrix trace (2.1)#include <stdio.h>
#include <mpi.h>
void main(int argc, char ** argv) {
int me, np, root=0;
int N; /* Suppose N = m*np */
double A[N][N];
double buffer[N], diag[N];
double traceA, trace_loc;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &me);
MPI_Comm_size(MPI_COMM_WORLD, &np);
tranche = N/np;
/* Initialization of A made by 0 */
/* … */
/* buffering diagonal elements on the root process */
if (me == 0) {
for (i=root; i<N; i++)
buffer[i] = A[i][i];
}
/* Scatter operation allows to distribute the buffered elements among the processes */
MPI_Scatter(
buffer, tranche, MPI_DOUBLE,
diag, tranche, MPI_DOUBLE, MPI_COMM_WORLD);
QosCosGrid - Barcelona 3425 October 2006
Matrix trace (2.2)/* Each process computes the partial
trace */
trace_loc = 0;
for (i = 0; i < tranche; i++)
trace_loc += diag[i];
/* Then we do the reduction */
MPI_Reduce(&trace_loc, &traceA, 1, MPI_DOUBLE, MPI_SUM, root, MPI_COMM_WORLD);
if (me == root)
printf("La trace de A est : %f \n", traceA);
MPI_Finalize();
}
QosCosGrid - Barcelona 3525 October 2006
Contents Introduction to MPI
Message passing Different type of communication MPI functionalities
MPI structures Basic functions Data types Contexts and tags Groups and communication domains
Communication functions Point to point communications Asynchronous communications Global communications
MPI-2 One-sided communications I/O Dynamicity
QosCosGrid - Barcelona 3625 October 2006
One-sided communications (1/2) No synchronization during communications Allow simulated shared memory implementation (Remote
Memory Access)
Defining the part of memory other processes can access: MPI_Win_create() MPI_Win_free()
One-sided communication functions: MPI_Put() MPI_Get() MPI_Accumulate()
Operations: MPI_SUM, MPI_LAND, MPI_REPLACE
QosCosGrid - Barcelona 3725 October 2006
One-sided communications (2/2) Active synchronization function
MPI_Win_fence()Take a win window of memory as parameterCollective operation (barrier) on all processes of the group MPI_Win_group(win)
Act as a synchronization barrier which ends every RMA transfer using the window win
Passive synchronization function MPI_Win_lock() and MPI_Win_unlock()
Classical mutex functionsThe communications initiator is the only responsible for the
synchronizationWhen MPI_Win_unlock() returns, every transfer operation is
finished
QosCosGrid - Barcelona 3825 October 2006
Parallel Input/Output Need for intelligent management of I/O is mandatory for
parallel applications MPI-IO is a set of functions for optimised I/O Extending classical file access functions
Collective synchronization for accessing file File offset shared or individual Blocking or non blocking read View (for accessing non sequential memory zone) Similar syntax as MPI communication functions
QosCosGrid - Barcelona 3925 October 2006
Dynamic allocation of processes
Dynamic change of the number of processes Spawning new processes during execution
The MPI_Comm_spawn() function allow to create a new set of processes on other processors
An inter-communicator links the domain of the parent to the new domain gathering the new processes
The MPI_Intercom_merge() function allows the merge of a unique communicator from an inter-communicator
MPI-2 allows dynamic MPMD style using the function MPI_Comm_spawn_multiple()
MPI_Comm_get_attr (MPI_UNIVERSE_SIZE) is used to know the maximum possible number of MPI processes
Process destruction No explicit exit() function of MPI process For exiting a MPI process, its communicator MPI_COMM_WORLD must
contain only finalizing processes All inter-communicator must be closed before finalization
QosCosGrid - Barcelona 4025 October 2006
Remarks and conclusion MPI has become, thanks to the distributed computing
community, a standard library for message passing
The MPI-2 breaks the classic message passing SPMD model of MPI-1
Numbers of implementation exist, on most of architectures
Lots of Documentations and publications are available
QosCosGrid - Barcelona 4125 October 2006
Some pointers MPI standard official site
http://www-unix.mcs.anl.gov/mpi/
The MPI forum http://www.mpi-forum.org/
Book: MPI, The Complete Reference (Marc Snir et al.) http://www.netlib.org/utk/papers/mpi-book/mpi-book.html
QosCosGrid - Barcelona25 October 2006
MPICH-V Project
http://www.mpich-v.net
Pierre Lemarinier [email protected] de Recherche en Informatique University Paris SouthINRIA Futurs
MMessageessagePPassingassingIInterfacenterface
StandardStandardDescriptionPerformancePerformance
QosCosGrid - Barcelona 4325 October 2006
Contents MPI implementation
Performance metrics
High performance networks
Communication type / 0-copy
QosCosGrid - Barcelona 4425 October 2006
MPI implementation LAM-MPI
Optimised for collective operations
MPICH Easy writing of new low level driver
Open-MPI Try to combine performance and ease of the two prior ones Conform to MPI-2
IBM / NEC / FUJITSU… Complete and performant implementation of MPI-2 Target specific architecture
QosCosGrid - Barcelona 4525 October 2006
Performance metrics Comparison criteria
Latency bandwidth Collective operation Overlapping capabilities Real applications
Measuring tools Round Trip Time (ping-pong)
NetPipe
NAS benchmarksCGLUBTFT
QosCosGrid - Barcelona 4625 October 2006
High performance networks (1/3) Technologies
Myrinet Connexionless reliable api Registered buffers Fully programmable DMA NIC processor Up to full-duplex 2Gb/s bandwidth with Myrinet 2000
SCINet Torus topology based network with static routing No need to register buffers Very small latency (suitable for RMA) Up to 2Gb/s
Gigabit Ethernet No need to registered buffers DMA operations High latency Up to 1Gb/s and 10Gb/s bandwidth
Infiniband Reliable Connexion mode and Unreliable Datagram mode Registered buffers Queued DMA operations Up to 10Gb/s bandwidth
QosCosGrid - Barcelona 4725 October 2006
High performance networks (2/3) Technologies
Myrinet Socket-GM MPICH-GM
SCINet No functional socket API SCI-MPICH
Gigabit Ethernet Have to use socket interface
Infiniband IoIP LAM-MPI, MPICH, MPI/pro etc…
QosCosGrid - Barcelona 4825 October 2006
High performance networks (3/3) Technologies
QosCosGrid - Barcelona 4925 October 2006
Eager vs Rendez-vous (1/2) Eager protocol
Message is sent without controlBetter latency
Copied in a buffer if the receiver has not posted the reception yetMemory consuming for long messages
Used only for long messages (<64KB)
Rendez-vous protocol Sender and receiver are synchronized
High latency
0-copyBetter bandwidthReduce the memory consumption
QosCosGrid - Barcelona 5025 October 2006
Eager vs Rendez-vous (2/2)
QosCosGrid - Barcelona 5125 October 2006
Communication types
QosCosGrid - Barcelona 5225 October 2006
High performance networks and 0-copy
Latence Myrinet : 8µsLatence MPICH-GM : 33µsLatence MPICH-Vdummy : 94µs
QosCosGrid - Barcelona 5325 October 2006
Conclusion Many MPI implementation with similar performance Multiple measures criteria and multiple tools
Latency, bandwidth Benchmarks and microbenchmarks Real applications
High performance networks lead to consider small performance details Network bandwidth equals the memory bandwidth Latency smaller than some OS operations Performance relies on good programming
Performance results can vary a lot according to the type of communication employed
Asynchronism is mandatory Bad programming results in bad performance 0-copy can be mandatory