high performance computing course notes 2007-2008 message passing programming i

High Performance ComputingHigh Performance ComputingCourse Notes 2007-2008Course Notes 2007-2008

Message Passing Programming IMessage Passing Programming I

2Computer Science, University of WarwickComputer Science, University of Warwick

Message Passing Programming

Message Passing is the most widely used parallel programming model

Message passing works by creating a number of tasks, uniquely named, that interact by sending and receiving messages to and from one another (hence the message passing)

Generally, processes communicate through sending the data from the address space of one process to that of another

Communication of processes (via files, pipe, socket)

Communication of threads within a process (via global data area)

Programs based on message passing can be based on standard sequential language programs (C/C++, Fortran), augmented with calls to library functions for sending and receiving messages


Message Passing Interface (MPI)Message Passing Interface (MPI)

MPI is a specification, not a particular implementation

Does not specify process startup, error codes, amount of system buffer, etc

MPI is a library, not a language

The goals of MPI: functionality, portability and efficiency

Message passing model > MPI specification > MPI implementation


OpenMP vs MPIOpenMP vs MPI

In a nutshell

MPI is used on distributed-memory systems

OpenMP is used for code parallelisation on shared-memory systems

Both are explicit parallelism

High-level control (OpenMP), lower-level control (MPI)


A little historyA little history

Message-passing libraries developed for a number of early distributed memory computers

By 1993 there were loads of vendor specific implementations

By 1994 MPI-1 came into being

By 1996 MPI-2 was finalized


The MPI programming modelThe MPI programming model

MPI standards -

MPI-1 (1.1, 1.2), MPI-2 (2.0)

Forwards compatibility preserved between versions

Standard bindings - for C, C++ and Fortran. Have seen MPI bindings for Python, Java etc (all non-standard)

We will stick to the C binding, for the lectures and coursework. More info on MPI www.mpi-forum.org

Implementations - For your laptop pick up MPICH (free portable implementation of MPI (http://www-unix.mcs.anl. gov/mpi/mpich/index.htm)

Coursework will use MPICH


MPIMPI

MPI is a complex system comprising of 129 functions with numerous parameters and variants

Six of them are indispensable, but can write a large number of useful programs already

Other functions add flexibility (datatype), robustness (non-blocking send/receive), efficiency (ready-mode communication), modularity (communicators, groups) or convenience (collective operations, topology).

In the lectures, we are going to cover most commonly encountered functions


The MPI programming modelThe MPI programming model

Computation comprises one or more processes that communicate via library routines and sending and receiving messages to other processes

(Generally) a fixed set of processes created at outset, one process per processor

Different from PVM


Intuitive Interfaces for sending and Intuitive Interfaces for sending and receiving messages receiving messages

Send(data, destination), Receive(data, source)

minimal interface

Not enough in some situations, we also need

Message matching – add message_id at both send and receive interfaces

they become Send(data, destination, msg_id), receive(data, source, msg_id)

Message_id• Is expressed using an integer, termed as message tag

• Allows the programmer to deal with the arrival of messages in an orderly fashion (queue and then deal with


How to express the data in the How to express the data in the send/receive interfacessend/receive interfaces

Early stages: (address, length) for the send interface

(address, max_length) for the receive interface

They are not always good The data to be sent may not be in the contiguous memory locations

Storing format for data may not be the same or known in advance in heterogeneous platform

Enventually, a triple (address, count, datatype) is used to express the data to be sent and (address, max_count, datatype) for the data to be received

Reflecting the fact that a message contains much more structures than just a string of bits, For example, (vector_A, 300, MPI_REAL)

Programmers can construct their own datatype

Now, the interfaces become send(address, count, datatype, destination, msg_id) and receive(address, max_count, datatype, source, msg_id)


How to distinguish messagesHow to distinguish messages

Message tag is necessary, but not sufficient

So, communicator is introduced …


CommunicatorsCommunicators

Messages are put into contexts

Contexts are allocated at run time by the system in response to programmer requests

The system can guarantee that each generated context is unique

The processes belong to groups

The notions of context and group are combined in a single object, which is called a communicator

A communicator identifies a group of processes and a communication context

The MPI library defines a initial communicator, MPI_COMM_WORLD, which contains all the processes running in the system

The messages from different process groups can have the same tag

So the send interface becomes send(address, count, datatype, destination, tag, comm)


Status of the received messagesStatus of the received messages

The structure of the message status is added to the receive interface

Status holds the information about source, tag and actual message size

In the C language, source can be retrieved by accessing status.MPI_SOURCE,

tag can be retrieved by status.MPI_TAG and

actual message size can be retrieved by calling the function MPI_Get_count(&status, datatype, &count)

The receive interface becomes receive(address, maxcount, datatype, source, tag, communicator, status)


How to express source and destination How to express source and destination

The processes in a communicator (group) are identified by ranks

If a communicator contains n processes, process ranks are integers from 0 to n-1

Source and destination processes in the send/receive interface are the ranks


Some other issuesSome other issues

In the receive interface, tag can be a wildcard, which means any message will be received

In the receive interface, source can also be a wildcard, which match any source


MPI basicsMPI basics

First six functions (C bindings)

MPI_Send (buf, count, datatype, dest, tag, comm)

Send a messagebuf address of send buffercount no. of elements to send (>=0)datatype of elementsdest process id of destination tag message tagcomm communicator (handle)





Calculating the size of the data to be send …

buf address of send buffer

count * sizeof (datatype) bytes of data




MPI_Recv (buf, count, datatype, source, tag, comm, status)

Receive a message

buf address of receive buffer (var param)

count max no. of elements in receive buffer (>=0)

datatype of receive buffer elements

source process id of source process, or MPI_ANY_SOURCE

tag message tag, or MPI_ANY_TAG

comm communicator

status status object




MPI_Init (int *argc, char ***argv)

Initiate a computation

argc (number of arguments) and argv (argument vector) are main program’s arguments

Must be called first, and once per process

MPI_Finalize ( )

Shut down a computation

The last thing that happens




MPI_Comm_size (MPI_Comm comm, int *size)

Determine number of processes in comm

comm is communicator handle, MPI_COMM_WORLD is the default (including all MPI processes)

size holds number of processes in group

MPI_Comm_rank (MPI_Comm comm, int *pid)

Determine id of current (or calling) process

pid holds id of current process


#include "mpi.h" #include <stdio.h> int main(int argc, char *argv[]) { int rank, nprocs;

MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&nprocs); MPI_Comm_rank(MPI_COMM_WORLD,&rank); printf("Hello, world. I am %d of %d\n", rank, nprocs); MPI_Finalize(); }

MPI basics – a basic exampleMPI basics – a basic example

mpirun –np 4 myprog

Hello, world. I am 1 of 4





MPI basics – send and recv example (1)MPI basics – send and recv example (1)

#include "mpi.h"#include <stdio.h> int main(int argc, char *argv[]){ int rank, size, i; int buffer[10]; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (size < 2) { printf("Please run with two processes.\n"); MPI_Finalize(); return 0; } if (rank == 0) { for (i=0; i<10; i++) buffer[i] = i; MPI_Send(buffer, 10, MPI_INT, 1, 123, MPI_COMM_WORLD); }


MPI basics – send and recv example (2)MPI basics – send and recv example (2)

if (rank == 1) { for (i=0; i<10; i++) buffer[i] = -1; MPI_Recv(buffer, 10, MPI_INT, 0, 123, MPI_COMM_WORLD, &status); for (i=0; i<10; i++) { if (buffer[i] != i) printf("Error: buffer[%d] = %d but is expected to be %d\n", i, buffer[i], i); } } MPI_Finalize();}


MPI language bindingsMPI language bindings

Standard (accepted) bindings for Fortran, C and C++

Java bindings are work in progress

JavaMPIJava wrapper to native calls

mpiJavaJNI wrappers

jmpi pure Java implementation of MPI library

MPIJ same idea

Java Grande Forum trying to sort it all out

We will use the C bindings


High Performance ComputingHigh Performance ComputingCourse Notes 2007-2008Course Notes 2007-2008

Message Passing Programming II


ModularityModularity

MPI supports modular programming via communicators

Provides information hiding by encapsulating local communications and having local namespaces for processes

All MPI communication operations specify a communicator (process group that is engaged in the communication)


Forming new communicators – Forming new communicators – one approachone approach

MPI_Comm world, workers;

MPI_Group world_group, worker_group;

int ranks[1];

MPI_Init(&argc, &argv);

world=MPI_COMM_WORLD;

MPI_Comm_size(world, &numprocs);

MPI_Comm_rank(world, &myid);

server=numprocs-1;

MPI_Comm_group(world, &world_group);

ranks[0]=server;

MPI_Group_excl(world_group, 1, ranks, &worker_group);

MPI_Comm_create(world, worker_group, &workers);

MPI_Group_free(&world_group);

MPI_Comm_free(&workers);


Forming new communicators - functionsForming new communicators - functions

int MPI_Comm_group(MPI_Comm comm, MPI_Group *group)

int MPI_Group_excl(MPI_Group group, int n, int *ranks, MPI_Group *newgroup)

Int MPI_Group_incl(MPI_Group group, int n, int *ranks, MPI_Group *newgroup)

int MPI_Comm_create(MPI_Comm comm, MPI_Group group, MPI_Comm *newcomm)

int MPI_Group_free(MPI_Group *group)

int MPI_Comm_free(MPI_Comm *comm)


Forming new communicators – Forming new communicators – another approach (1)another approach (1)

MPI_Comm_split (comm, colour, key, newcomm)

Creates one or more new communicators from the original comm

comm communicator (handle)colour control of subset assignment (processes with

same colour are in same new communicator)key control of rank assignmentnewcomm new communicator

Is a collective communication operation (must be executed by all processes in the process group comm)

Is used to (re-) allocate processes to communicator (groups)




MPI_Comm comm, newcomm; int myid, color;

MPI_Comm_rank(comm, &myid); // id of current process

color = myid%3;

MPI_Comm_split(comm, colour, myid, *newcomm);

0 1 2 3 4 5 6 7

0 1 2 21 10 00: 1: 2:




New communicator created for each new value of colour

Each new communicator (sub-group) comprises those processes that specify its value in colour

These processes are assigned new identifiers (ranks, starting at zero) with the order determined by the value of key (or by their ranks in the old communicator in event of ties)


Communications Communications

Point-to-point communications: involving exact two processes, one sender and one receiver

For example, MPI_Send() and MPI_Recv()

Collective communications: involving a group of processes


Collective operationsCollective operations

i.e. coordinated communication operations involving multiple processes

Programmer could do this by hand (tedious), MPI provides a specialized collective communications

barrier – synchronize all processes

broadcast – sends data from one to all processes

gather – gathers data from all processes to one process

scatter – scatters data from one process to all processes

reduction operations – sums, multiplies etc. distributed data

all executed collectively (on all processes in the group, at the same time, with the same parameters)


MPI_Barrier (comm)

Global synchronization

comm is the communicator handle

No processes return from function until all processes have called it

Good way of separating one phase from another



Barrier synchronizationsBarrier synchronizations

You are only as quick as your slowest process

Barrier sync. Barrier sync.


MPI_Bcast (buf, count, type, root, comm)

Broadcast data from root to all processes

buf address of input buffer or output buffer (root)

count no. of entries in buffer (>=0)type datatype of buffer elementsroot process id of root processcomm communicator


proc.

data

A0A0

A0

A0

A0

One to all broadcast

MPI_BCAST


Broadcast 100 ints from process 0 to every process in the group

MPI_Comm comm;

int array[100];

int root = 0;

…

MPI_Bcast (array, 100, MPI_INT, root, comm);

Example of MPI_BcastExample of MPI_Bcast


MPI_Gather (inbuf, incount, intype, outbuf, outcount, outtype, root, comm)

Collective data movement function

inbuf address of input bufferincount no. of elements sent from each (>=0)intype datatype of input buffer elementsoutbuf address of output buffer (var param)outcount no. of elements received from eachouttype datatype of output buffer elementsroot process id of root processcomm communicator


proc.

data

A0A0

A1

A2

A3

All to one gather

MPI_GATHER

A1 A2 A3




inbuf address of input bufferincount no. of elements sent from each (>=0)intype datatype of input buffer elementsoutbuf address of output bufferoutcount no. of elements received from eachouttype datatype of output buffer elementsroot process id of root processcomm communicator


proc.

data

A0A0

A1

A2

A3

All to one gather

MPI_GATHER

A1 A2 A3

Input to gather






proc.

data

A0A0

A1

A2

A3

All to one gather

MPI_GATHER

A1 A2 A3

Output gather






proc.

data

A0A0

A1

A2

A3

All to one gather

MPI_GATHER

A1 A2 A3

Receiving proc.


MPI_Gather exampleMPI_Gather example

Gather 100 ints from every process in group to root

MPI_Comm comm;

int gsize, sendarray[100];

int root, myrank, *rbuf;

...

MPI_Comm_rank( comm, myrank); // find proc. id

If (myrank == root) {

MPI_Comm_size( comm, &gsize); // find group size

rbuf = (int *) malloc(gsize*100*sizeof(int)); // calc. receive buffer

}

MPI_Gather( sendarray, 100, MPI_INT, rbuf, 100, MPI_INT, root, comm);


MPI_Scatter (inbuf, incount, intype, outbuf, outcount, outtype, root, comm)


inbuf address of input bufferincount no. of elements sent to each (>=0)intype datatype of input buffer elementsoutbuf address of output bufferoutcount no. of elements received by eachouttype datatype of output buffer elementsroot process id of root processcomm communicator


proc.

data

A0A0A1 A2 A3

One to all scatter

MPI_SCATTER

A1

A2

A3


Example of MPI_ScatterExample of MPI_Scatter

MPI_Scatter is reverse of MPI_Gather

It is as if the root sends using

MPI_Send(inbuf+i*incount * sizeof(intype), incount, intype, i, …)

MPI_Comm comm; int gsize, *sendbuf; int root, rbuff[100]; … MPI_Comm_size (comm, &gsize); sendbuf = (int *) malloc (gsize*100*sizeof(int)); … MPI_Scatter (sendbuf, 100, MPI_INT, rbuf, 100, MPI_INT, root, comm);


MPI_Reduce (inbuf, outbuf, count, type, op, root, comm)

Collective reduction function

inbuf address of input bufferoutbuf address of output buffercount no. of elements in input buffer (>=0)type datatype of input buffer elementsop operationroot process id of root processcomm communicator


proc.data

2 Using MPI_MINRoot = 0

MPI_REDUCE

4 0 2

5 7

0 3

26


MPI_Reduce (inbuf, outbuf, count, type, op, root, comm)


inbuf address of input bufferoutbuf address of output buffercount no. of elements in input buffer (>=0)type datatype of input buffer elementsop operationroot process id of root processcomm communicator


proc.data

2 Using MPI_SUMRoot = 1

MPI_REDUCE

4

13 165 7

0 3

26


MPI_Allreduce (inbuf, outbuf, count, type, op, comm)


inbuf address of input bufferoutbuf address of output buffer (var param)count no. of elements in input buffer (>=0)type datatype of input buffer elementsop operationcomm communicator


proc.data

2 Using MPI_MIN

MPI_ALLREDUCE

4 0 2

5 7

0 3

26

0

0

0

2

2

2


Buffering in MPI communicationsBuffering in MPI communications

Application buffer: specified by the first parameter in MPI_Send/Recv functions

System buffer:

Hidden from the programmer and managed by the MPI library

Is limitted and can be easy to exhaust


Blocking and non-blocking Blocking and non-blocking communicationscommunications

Blocking send The sender doesn’t return until the application buffer can be re-used (which often

means that the data have been copied from application buffer to system buffer), but doesn’t mean that the data will be received


Blocking receive The receiver doesn’t return until the data have been ready to use by the receiver

(which often means that the data have been copied from system buffer to application buffer)

Non-blocking send/receive The calling process returns immediately

Just request the MPI library to perform the operation, the user cannot predict when that will happen

Unsafe to modify the application buffer until you can make sure the requested operation has been performed (MPI provides routines to test this)

Can be used to overlap computation with communication and have possible performance gains

MPI_Isend (buf, count, datatype, dest, tag, comm, request)


Testing non-blocking communications Testing non-blocking communications for completionfor completion

Completion tests come in two types:

WAIT type

TEST type

WAIT type: the WAIT type testing routines block until the communication has completed.

A non-blocking communication immediately followed by a WAIT-type test is equivalent to the corresponding blocking communication

TEST type: these routines return TRUE or FALSE value

The process can perform some other tasks when the communication has not completed


Testing non-blocking communications Testing non-blocking communications for completionfor completion

The WAIT-type test is:

MPI_Wait (request, status)

This routine blocks until the communication specified by the handle request has completed. The request handle will have been returned by an earlier call to a non-blocking communication routine.

The TEST-type test is:

MPI_Test (request, flag, status)

In this case the communication specified by the handle request is simply queried to see if the communication has completed and the result of the query (TRUE or FALSE) is returned immediately in flag.


Testing multiple non-blocking Testing multiple non-blocking communications for completioncommunications for completion

Wait for all communications to complete

MPI_Waitall (count, array_of_requests, array_of_statuses)

This routine blocks until all the communications specified by the request handles, array_of_requests, have completed. The statuses of the communications are returned in the array array_of_statuses and each can be queried in the usual way for the source and tag if required

Test if all communications have completed

MPI_Testall (count, array_of_requests, flag, array_of_statuses)

If all the communications have completed, flag is set to TRUE, and information about each of the communications is returned in array_of_statuses. Otherwise flag is set to FALSE and array_of_statuses is undefined.


Testing multiple non-blocking Testing multiple non-blocking communications for completioncommunications for completion

Query a number of communications at a time to find out if any of them have completed

Wait: MPI_Waitany (count, array_of_requests, index, status)

MPI_WAITANY blocks until one or more of the communications associated with the array of request handles, array_of_requests, has completed.

The index of the completed communication in the array_of_requests handles is returned in index, and its status is returned in status.

Should more than one communication have completed, the choice of which is returned is arbitrary.

Test: MPI_Testany (count, array_of_requests, index, flag, status)

The result of the test (TRUE or FALSE) is returned immediately in flag.

high performance computing course notes 2007-2008 message passing programming i

Documents

university of warwick

mpi bindings

nutshell mpi

goals of mpi

computer science

lowerlevel control mpi

efficiency message

mpich slide