qoscosgrid - barcelona25 october 2006 mpich-v project pierre lemarinier [email protected]...

QosCosGrid - Barcelona25 October 2006

MPICH-V Project

http://www.mpich-v.net

Pierre Lemarinier [email protected] de Recherche en Informatique University Paris SouthINRIA Futurs

MMessageessagePPassingassingIInterfacenterface

StandardStandardDescriptionDescriptionPerformance

QosCosGrid - Barcelona 225 October 2006

Contents Introduction to MPI

Message passing Different type of communication MPI functionalities

MPI structures Basic functions Data types Contexts and tags Groups and communication domains

Communication functions Point to point communications Asynchronous communications Global communications

MPI-2 One-sided communications I/O Dynamicity


Message passing (1) Problem :

We have N nodes All nodes connected by network

How to use the global computer gathering the N nodes ?

RAM

CPU

RAM

CPU

RAM

CPU

RAM

CPU

RAM

CPU

RAM

CPU

RAM

CPU

Network


Message passing (2)One answer : message passing

Execute one process per processorExchange explicitly data between processorsSynchronize explicitly the different processes

Two types of data transfer :Only one process initiate the communication: ‘one sided’ The two processes cooperate for the communication:

‘cooperative’


Two types of data transfer ‘one sided’ communications

No Rendez-vous protocol No warning about reading or

writing actions inside local memory for a process

Costly synchronization

Functions prototypes : put(remote_process, data) get(remote_process, data)

Cooperatives Communications The communication involves

the two processes Implicit synchronization in the

simple case

Functions prototypes : send(destination, data) recv(source, data)

CPU CPU

put()

CPU CPU

get()

CPU CPU

send() recv()


MPI (Message Passing Interface) Standard developed by academics and industrial partners

Objective: to specify a portable message passing library

Imply an execution environment for launching and connecting together all the processes

Allow: Synchronous and asynchronous communications Global communications

Separated communication domains




MPI structures Basic functions (exemple HelloWorld_MPI.c) Data types Contexts and tags Groups and communication domains




MPI Programming Structure Follows the SPMD programming model

All processes are launched at the same time Same program for every processors Can differentiate processors roles by a rank number

Sequential section

MPI initialization

Parallel initialization

Computation

Communications

Synchronization

End of parallel section

Sequential section

Non parallel section

Remark: Most implementations advise to limit this program part to the exit call

Multinode parallel section (MPI)

Parallel section initialization

Parallel section termination


Basic functions MPI environment initialization

C : MPI_Init(&argc, char &argv); Fortran : call MPI_Init(ierror)

MPI Environment termination (program are recommended to exit after this function call) C : MPI_Finalize(); Fortran : call MPI_Finalize(ierror)

Getting the process rank C : MPI_Comm_rank(MPI_COMM_WORLD, &rank); Fortran : call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierror)

Getting the total number of processes C : MPI_Comm_size(MPI_COMM_WORLD, &size); Fortran : call MPI_comm_size(MPI_COMM_WORLD, size, ierror)


HelloWorld_MPI.c#include <stdio.h>

#include <mpi.h>

void main(int argc, char ** argv) {

int rang, nprocs;

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rang);

MPI_Comm_size(MPI_COMM_WORLD, &nprocs);

printf(“hello, I am %d (Of %d processes)\n”, rang, nprocs);

MPI_Finalize();

}


MPI data types

MPI_PACKED

MPI_BYTE

long doubleMPI_LONG_DOUBLE

doubleMPI_DOUBLE

floatMPI_FLOAT

unsigned long intMPI_UNSIGNED_LONG

unsigned intMPI_UNSIGNED

unsigned short intMPI_UNSIGNED_SHORT

unsigned charMPI_UNSIGNED_CHAR

signed long intMPI_LONG

signed intMPI_INT

signed short intMPI_SHORT

signed charMPI_CHAR

Type CType MPI

MPI_PACKED

MPI_BYTE

CHARACTER(1)MPI_CHARACTER

LOGICALMPI_LOGICAL

COMPLEXMPI_COMPLEX

DOUBLE PRECISIONMPI_DOUBLE_PRECISION

REALMPI_REAL

INTEGERMPI_INTEGER

Type FORTRANType MPI


User data types By default: MPI exchanges data using vector of MPI data

It is possible to create data types to simplify communication operations (simplifying buffer and linearization operations)

User data types replace the obsolete MPI_PACK type

A user type consists in a sequence of basic types and a sequence of offsets for fitting the memory creation : MPI_Type_commit(type) ; Destruction : MPI_Type_free(type) ;


Contexts and tags Need to distinguish different messages in reception

Context allow to distinguish between a point-to-point communication and a global communication

Every message is sent in a within a context, and must be received in the same context

Context is automatically managed by MPI

The communication tags allow to identify one communication among multiple ones

When communication are made asynchronously, this tags allow to sort them

For reception operations, we can received the next message by specifying the MPI_ANY_TAG keyword

Tag management is up to the MPI programmer


Communication domains Nodes can be grouped in a communication domain called

communicator

Every process as a rank number per group it is involved in

MPI_COMM_WORLD is the default communication domain gathering all processes and created at the initialization.

More generally, All operations can only be made on a single set of processes specified by their communicator

Each domain constitutes an distinct specific context for communications


Split a communicator (1/2): groups To create a new domain, first you have to create a new

group of processes: int MPI_Comm_group(MPI_Comm comm, MPI_Group *group); int MPI_Group_incl(MPI_Group group, int rsize, int

*ranks, MPI_Group *newgroup); int MPI_Group_excl(MPI_Group group, int rsize, int

*ranks, MPI_Group *newgroup); Set of operations on the groups:

int MPI_Group_union(MPI_Group g1, MPI_Group g2, MPI_Group *gr) ;

int MPI_Group_intersection(MPI_Group g1, MPI_Group g2, MPI_Group *gr) ;

int MPI_Group_difference(MPI_Group g1, MPI_Group g2, MPI_Group *gr) ;

Destruction of a group: int MPI_Group_free(MPI_Group *group) ;


Split a communicator (2/2): communicators Associating a communicator to a group:

int MPI_Comm_create(MPI_Comm comm, MPI_Group group, MPI_Comm *newcomm) ;

Dividing a domain in sub-domains: int MPI_Comm_split(MPI_Comm comm, int color, int key,

MPI_Comm *newcomm) ;MPI_Comm_split is a collective operation on the initial

communicator commEvery process gives its color, Every process of the same color are

then in the same newcommThe MPI_UNDEFINED color allows for a process to not be part of the

new communicatorEvery process gives its key, Processes of the same color are ranked

by these keysA group is implicitly created for each new communicator created

this way

Communicators destruction: int MPI_Comm_free(MPI_Comm *comm) ;





Communication functions Point to point communications (exemple Jeton.c) Asynchronous communications Global communications (exemple trace.c)



Point-to-point communications Send and receive data between a pair of processes

The two processes initiates the communication, one sends the data, the other asks for the reception

Communications are identified by tags

The type and the size of the data must be specified


Basic communication functions Synchronous sending (between the computation process

and the action of sending): int MPI_Send(void* buf, int count, MPI_Datatype

datatype, int dest, int tag, MPI_Comm comm) ; The tag allow unique identifying of messages

Synchronous data reception: int MPI_Recv(void* buf, int count, MPI_Datatype

datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) ;

The tag must be identical to the tag sent MPI_ANY_SOURCE can be specified to receive from anyone


Jeton.c#include <stdio.h>

#include <mpi.h>


int me, prec, suiv, np;

int jeton = 0;

MPI_Status * status;


MPI_Comm_rank(MPI_COMM_WORLD, &me);

MPI_Comm_size(MPI_COMM_WORLD, &np);

if (me == 0)

prec = np – 1;

else

prec = me – 1;

if (me == np - 1)

suiv = 0;

else

suiv = me + 1;

if (me == 0)

MPI_Send(&jeton, 1, MPI_INT, suiv, 0, MPI_COMM_WORLD,);

while (1) {

MPI_Recv(&jeton, 1, MPI_INT, prec, 0, MPI_COMM_WORLD, status);

MPI_Send(&jeton, 1, MPI_INT, suiv, 0, MPI_COMM_WORLD);

}

MPI_Finalize();

}

3

4

5

21

0n

p -1


Synchronism and asynchronism (1) To solve some deadlocks, and to allow le recouvrement des

communications par le calcul, one can use non blocking functions

In this case, the communication scheme is the following: Initialization of the non blocking communication (by either the

two or one of the process) The communication (non blocking or blocking) is called by

other process … computation Termination of the communication (Blocking operation until the

communication is performed)


Synchronism and asynchronism (2) Non blocking functions :

int MPI_Isend(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request);

int MPI_Irecv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request) ;

The request field is used to know the state of a non blocking communication. To wait for its termination, one can call the following function: int MPI_Wait(MPI_Request *request, MPI_Status

*status) ;


Synchronism and asynchronism (3)

Data can be exchanged by blocking or non blocking functions. There are multiple functions to manage how the send and the receive operation are coupled

To fix the communication mode, you use prefix (MPI_[*]Send): Synchronous send ([S]) : finished when the coresponding receive is

posted (hard coupled to the reception, without buffers) Buffered send ([B]) : a buffer is created, the send operation ends when

the user buffer is copied to the system buffer (not coupled to the reception)

Standard send () : the send ends when the emission buffer is empty (MPI implementation decides for buffering or coupling to reception)

Ready send ([R]) : User assures that reception request is already posted when calling this function (coupled to the reception without buffer)


Collective or global operations

To simplify communication operation involving multiple processes, one can use collective operations on a communicator

Typical operations: reductions

Data exchange:BroadcastScatterGatherAll-to-All

Explicit synchronization


Reductions (1) A reduction is an arithmetic operation on the distributed

data made by a set of processors Prototype :

C : int MPI_Reduce(void * sendbuf, void* recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm communicator);

Fortran : MPI_Reduce(sendbuf, recvbuf, count, datatype, op, root, communicator, ierror)

Using MPI_Reduce(), only the root processor gets the result

With MPI_AllReduce(), all processes get the result


Reductions (2) Available operations

Maximum and localizationMPI_MAXLOC

Minimum and localizationMPI_MINLOC

Bit/bit exclusive orMPI_BXOR

Logical exclusive orMPI_LXOR

Bit/bit orMPI_BOR

Logical orMPI_LOR

Bit/bit andMPI_BAND

Logical andMPI_LAND

Product element by elementMPI_PROD

SumMPI_SUM

MaximumMPI_MAX

MinimumMPI_MIN

OperationMPI_Op


Broadcast A broadcast operation allows to distribute the same data to all

processes

One-to-all communication, from a specified process ‘root’ to all processes of a communicator

Prototypes : C : int MPI_Bcast(void *buffer, int count, MPI_Datatype

datatype, int root, MPI_Comm comm); Fortran : MPI_Bcast(buffer, count, datatype, root,

communicator, ierror)

0 1 2 3 np-1

root = 1

0 1 2 3 np-1

buffer


Scatter One-to-all operation, different data are sent to each receiver process

according to their rank Prototypes :

C : int MPI_Scatter(void * sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm communicator);

Fortran : MPI_Scatter(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, communicator, ierror)

The ‘send’ parameters are used by only the sender process

sendbuf

recvbuf

0 1 2 3 np-1

root = 2

0 1 2 3 np-1


Gather All-to-one operation, different data are received by a receiver process Prototypes :

C : int MPI_Gather(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm communicator);

Fortran : MPI_Gather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, communicator, ierror)

The ‘receive’ parameters are only used by the receiver process

sendbuf

recvbuf

0 1 2 3 np-1

root = 3

0 1 2 3 np-1


All-to-All All-to-all operation, different data are sent to each process,

according to their rank Prototypes :

C : int MPI_AlltoAll(void * sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm communicator);

Fortran : MPI_Alltoall(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, communicator, ierror)

sendbuf

recvbuf

0 1 2 3 np-1 0 1 2 3 np-1


Explicit Synchronization Synchronization barrier : All processes of a communicator

waits for the last process to enter the barrier before continuing their execution

For computer with material barrier available (such as SGI and Cray T3E), the MPI barrier is slower than these material barrier

Prototype C : int MPI_Barrier (MPI_Comm communicator); Fortran : MPI_Barrier(Communicator, IERROR)


Matrix trace (1) Computing the trace of a matrix An

The matrix trace is the sum of the diagonal element (square matrix)

One can easily see that the sum can be made on multiple processor, ending by using a reduction to compute the complete trace

n

k

kkAATrace1

,)(


Matrix trace (2.1)#include <stdio.h>

#include <mpi.h>


int me, np, root=0;

int N; /* Suppose N = m*np */

double A[N][N];

double buffer[N], diag[N];

double traceA, trace_loc;


MPI_Comm_rank(MPI_COMM_WORLD, &me);

MPI_Comm_size(MPI_COMM_WORLD, &np);

tranche = N/np;

/* Initialization of A made by 0 */

/* … */

/* buffering diagonal elements on the root process */

if (me == 0) {

for (i=root; i<N; i++)

buffer[i] = A[i][i];

}

/* Scatter operation allows to distribute the buffered elements among the processes */

MPI_Scatter(

buffer, tranche, MPI_DOUBLE,

diag, tranche, MPI_DOUBLE, MPI_COMM_WORLD);


Matrix trace (2.2)/* Each process computes the partial

trace */

trace_loc = 0;

for (i = 0; i < tranche; i++)

trace_loc += diag[i];

/* Then we do the reduction */

MPI_Reduce(&trace_loc, &traceA, 1, MPI_DOUBLE, MPI_SUM, root, MPI_COMM_WORLD);

if (me == root)

printf("La trace de A est : %f \n", traceA);

MPI_Finalize();

}


One-sided communications (1/2) No synchronization during communications Allow simulated shared memory implementation (Remote

Memory Access)

Defining the part of memory other processes can access: MPI_Win_create() MPI_Win_free()

One-sided communication functions: MPI_Put() MPI_Get() MPI_Accumulate()

Operations: MPI_SUM, MPI_LAND, MPI_REPLACE


One-sided communications (2/2) Active synchronization function

MPI_Win_fence()Take a win window of memory as parameterCollective operation (barrier) on all processes of the group MPI_Win_group(win)

Act as a synchronization barrier which ends every RMA transfer using the window win

Passive synchronization function MPI_Win_lock() and MPI_Win_unlock()

Classical mutex functionsThe communications initiator is the only responsible for the

synchronizationWhen MPI_Win_unlock() returns, every transfer operation is

finished


Parallel Input/Output Need for intelligent management of I/O is mandatory for

parallel applications MPI-IO is a set of functions for optimised I/O Extending classical file access functions

Collective synchronization for accessing file File offset shared or individual Blocking or non blocking read View (for accessing non sequential memory zone) Similar syntax as MPI communication functions


Dynamic allocation of processes

Dynamic change of the number of processes Spawning new processes during execution

The MPI_Comm_spawn() function allow to create a new set of processes on other processors

An inter-communicator links the domain of the parent to the new domain gathering the new processes

The MPI_Intercom_merge() function allows the merge of a unique communicator from an inter-communicator

MPI-2 allows dynamic MPMD style using the function MPI_Comm_spawn_multiple()

MPI_Comm_get_attr (MPI_UNIVERSE_SIZE) is used to know the maximum possible number of MPI processes

Process destruction No explicit exit() function of MPI process For exiting a MPI process, its communicator MPI_COMM_WORLD must

contain only finalizing processes All inter-communicator must be closed before finalization


Remarks and conclusion MPI has become, thanks to the distributed computing

community, a standard library for message passing

The MPI-2 breaks the classic message passing SPMD model of MPI-1

Numbers of implementation exist, on most of architectures

Lots of Documentations and publications are available


Some pointers MPI standard official site

http://www-unix.mcs.anl.gov/mpi/

The MPI forum http://www.mpi-forum.org/

Book: MPI, The Complete Reference (Marc Snir et al.) http://www.netlib.org/utk/papers/mpi-book/mpi-book.html

QosCosGrid - Barcelona25 October 2006

MPICH-V Project

http://www.mpich-v.net

Pierre Lemarinier [email protected] de Recherche en Informatique University Paris SouthINRIA Futurs

MMessageessagePPassingassingIInterfacenterface

StandardStandardDescriptionPerformancePerformance


Contents MPI implementation

Performance metrics

High performance networks

Communication type / 0-copy


MPI implementation LAM-MPI

Optimised for collective operations

MPICH Easy writing of new low level driver

Open-MPI Try to combine performance and ease of the two prior ones Conform to MPI-2

IBM / NEC / FUJITSU… Complete and performant implementation of MPI-2 Target specific architecture


Performance metrics Comparison criteria

Latency bandwidth Collective operation Overlapping capabilities Real applications

Measuring tools Round Trip Time (ping-pong)

NetPipe

NAS benchmarksCGLUBTFT


High performance networks (1/3) Technologies

Myrinet Connexionless reliable api Registered buffers Fully programmable DMA NIC processor Up to full-duplex 2Gb/s bandwidth with Myrinet 2000

SCINet Torus topology based network with static routing No need to register buffers Very small latency (suitable for RMA) Up to 2Gb/s

Gigabit Ethernet No need to registered buffers DMA operations High latency Up to 1Gb/s and 10Gb/s bandwidth

Infiniband Reliable Connexion mode and Unreliable Datagram mode Registered buffers Queued DMA operations Up to 10Gb/s bandwidth



Myrinet Socket-GM MPICH-GM

SCINet No functional socket API SCI-MPICH

Gigabit Ethernet Have to use socket interface

Infiniband IoIP LAM-MPI, MPICH, MPI/pro etc…


Eager vs Rendez-vous (1/2) Eager protocol

Message is sent without controlBetter latency

Copied in a buffer if the receiver has not posted the reception yetMemory consuming for long messages

Used only for long messages (<64KB)

Rendez-vous protocol Sender and receiver are synchronized

High latency

0-copyBetter bandwidthReduce the memory consumption


Eager vs Rendez-vous (2/2)


Communication types


High performance networks and 0-copy

Latence Myrinet : 8µsLatence MPICH-GM : 33µsLatence MPICH-Vdummy : 94µs


Conclusion Many MPI implementation with similar performance Multiple measures criteria and multiple tools

Latency, bandwidth Benchmarks and microbenchmarks Real applications

High performance networks lead to consider small performance details Network bandwidth equals the memory bandwidth Latency smaller than some OS operations Performance relies on good programming

Performance results can vary a lot according to the type of communication employed

Asynchronism is mandatory Bad programming results in bad performance 0-copy can be mandatory

qoscosgrid - barcelona25 october 2006 mpich-v project pierre lemarinier [email protected]...

Documents