cse 160 – lecture 16

CSE 160 – Lecture 16

MPI Concepts, Topology and Synchronization

So Far …

• We have concentrated on generic message passing algorithms

• We have used PVM to implement codes (simpler on workstations)

• MPI is a defacto standard for message passing and introduces some important concepts

Message Addressing

• Identify an endpoint • Use a tag to distinguish a particular message

– pvm_send(dest, tag)

– MPI_SEND(COMM, dest, tag, buf, len, type)

• Receiving– recv(src, tag); recv(*,tag); recv (src, *); recv(*,*);

• What if you want to build a library that uses message passing? Is (src, tag) safe in all instances?

Level O Issues

• Basic Pt-2-Pt Message Passing is straightforward, but how does one …– Make it go fast

• Eliminate extra memory copies• Take advantage of specialized hardware

– Move complex data structures (packing)– Receive from one-of-many (wildcarding)– Synchronize a group of tasks– Recover from errors– Start tasks– Build safe libraries– Monitor tasks– …

MPI-1 addresses many of the level 0 issues

(but not all)

A long history of research efforts in message passing

• P4• Chameleon• Parmacs• TCGMSG• CHIMP• NX (Intel i860, Paragon)• PVM• …

And these begot MPI

So What is MPI

• It is a standard message passing API – Specifies many variants of send/recv

• 9 send interface calls– Eg., synchronous send, asynchronous send, ready send,

asynchronous ready send

– Plus other defined APIs• Process topologies• Group operations• Derived Data types• Profiling API (standard way to instrument MPI code)

– Implemented and optimized by machine vendors– Should you use it? Absolutely!

So What’s Missing in MPI-1?

• Process control– How do you start 100 tasks?– How do you kill/signal/monitor remote tasks

• I/O– Addressed in MPI-2

• Fault-tolerance– One MPI process dies, the rest eventually hang

• Interoperability– No standard set for running a single parallel job across

architectures (eg. Cannot split computation between x86 Linux and Alpha)

Communication Envelope

• (src, tag) is enough to distinguish messages however, there is a problem:– How does one write safe libraries that also use

message passing?• Tags used in libraries can be the same as tags used

in calling code messages could get confused

• Message envelope – add another, system-assigned tag, to a message– This label defines a message envelope

Envelopes

Tag 100, Src A

Tag 100, Src B

Tag 200, Src A

Envelope

Tag 100, Src A

Tag 100, Src B

Tag 200, Src A

Envelope

Context tag – messages received relative to context

Communicators

• Messages ride inside of message envelopes and receives can only match candidate messages relative to an envelope.

• MPI implements these envelopes as communicators.

• Communicators are more than just and envelope – it also encompasses a group of processes

• Communicator = context + group of tasks

MPI_COMM_WORLD

• MPI programs are static – they can never grow or shrink

• There is one default communicator called MPI_COMM_WORLD

• All processes are members of COMM_WORLD• Each process is labeled 0 – N, for a communicator

of size N+1– Similar to a PVM group.

How do you build new groups?

• All new communicators are derived from existing communicators– All communicators have

MPI_COMM_WORLD as an “ancestor”– New communicators are formed synchronously

• All members of a new communicator call the construction routine at the same time.

• MPI_COMM_DUP(), MPI_COMM_SPLIT()

Communicators as groups

0

1

3

4

2

5

1

0

2

0 1

2

Running MPI Programs

• The MPI-1 Standard does not specify how to run an MPI

program, just as the Fortran standard does not specify how to

run a Fortran program.

• In general, starting an MPI program is dependent on the

implementation of MPI you are using, and might require various

scripts, program arguments, and/or environment variables.

• mpiexec <args> is part of MPI-2, as a recommendation,

but not a requirement

– You can use mpiexec for MPICH and mpirun for other systems

Finding Out About the Environment

• Two important questions that arise early in a parallel program are:– How many processes are participating in this

computation?– Which one am I?

• MPI provides functions to answer these questions:– MPI_Comm_size reports the number of processes.– MPI_Comm_rank reports the rank, a number between 0

and size-1, identifying the calling process

MPI Datatypes

• The data in a message to sent or received is described by a triple (address, count, datatype), where

• An MPI datatype is recursively defined as:– predefined, corresponding to a data type from the language (e.g.,

MPI_INT, MPI_DOUBLE_PRECISION)– a contiguous array of MPI datatypes– a strided block of datatypes– an indexed array of blocks of datatypes– an arbitrary structure of datatypes

• There are MPI functions to construct custom datatypes, such an array of (int, float) pairs, or a row of a matrix stored columnwise.

MPI Tags

• Messages are sent with an accompanying user-defined integer tag, to assist the receiving process in identifying the message.

• Messages can be screened at the receiving end by specifying a specific tag, or not screened by specifying MPI_ANY_TAG as the tag in a receive.

• Some non-MPI message-passing systems have called tags “message types”. MPI calls them tags to avoid confusion with datatypes.

Why Datatypes?

• Since all data is labeled by type, an MPI implementation can support communication between processes on machines with very different memory representations and lengths of elementary datatypes (heterogeneous communication).

• Specifying application-oriented layout of data in memory– reduces memory-to-memory copies in the implementation

– allows the use of special hardware (scatter/gather) when available

• Theoretically simplifies the programmer’s life

• How would you build “Datatypes” in PVM – sequences of pvm_pack()

Introduction to Collective Operations in MPI

• Collective operations are called by all processes in a communicator.

• MPI_BCAST distributes data from one process (the root) to all others in a communicator.– How does this differ from pvm_mcast(), pvm_broadcast()?

• MPI_REDUCE combines data from all processes in communicator and returns it to one process.

• In many numerical algorithms, SEND/RECEIVE can be replaced by BCAST/REDUCE, improving both simplicity and efficiency.

Example: PI in C -1#include "mpi.h"#include <math.h>int main(int argc, char *argv[]){

int done = 0, n, myid, numprocs, i, rc;double PI25DT = 3.141592653589793238462643;double mypi, pi, h, sum, x, a;MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD,&numprocs);MPI_Comm_rank(MPI_COMM_WORLD,&myid);while (!done) { if (myid == 0) { printf("Enter the number of intervals: (0 quits) "); scanf("%d",&n); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); if (n == 0) break;

Example: PI in C - 2 h = 1.0 / (double) n;

sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += 4.0 / (1.0 + x*x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf("pi is approximately %.16f, Error is %.16f\n", pi, fabs(pi - PI25DT));}MPI_Finalize();

return 0;}

• Send a large message from process 0 to process 1– If there is insufficient storage at the destination, the send must wait for the user to

provide the memory space (through a receive)

• What happens with

Sources of Deadlocks

Process 0

Send(1)Recv(1)

Process 1

Send(0)Recv(0)

• This is called “unsafe” because it depends on the availability of system buffers (pvm does buffering)

Some Solutions to the “unsafe” Problem

• Order the operations more carefully:Process 0

Send(1)Recv(1)

Process 1

Recv(0)Send(0)

• Use non-blocking operations:

Process 0

Isend(1)Irecv(1)Waitall

Process 1

Isend(0)Irecv(0)Waitall

Extending the Message-Passing Interface

• Dynamic Process Management– Dynamic process startup– Dynamic establishment of connections

• One-sided communication– Put/get– Other operations

• Parallel I/O• Other MPI-2 features

– Generalized requests– Bindings for C++/ Fortran-90; interlanguage issues

When to use MPI

• Portability and Performance

• Irregular Data Structures

• Building Tools for Others– Libraries

• Need to Manage memory on a per processor basis

When not to use MPI

• Regular computation matches HPF– But see PETSc/HPF comparison (ICASE 97-72)

• Solution (e.g., library) already exists– http://www.mcs.anl.gov/mpi/libraries.html

• Require Fault Tolerance– Sockets

• Distributed Computing– CORBA, DCOM, etc.

ftp://ftp.icase.edu/pub/techreports/97/97-72.ps

http://www.mcs.anl.gov/mpi/libraries.html







Synchronization

• Messages serve two purposes:– Moving data– Synchronizing processes

• When does synchronization occur for– Blocking send?

– Non-blocking send?

» Dependencies on implementation

Common Synchronization Constructs

• Barrier– All processes stop, No information passed

• Broadcast– All processes get information from the root

• How does this differ from a barrier

• Ideas on how to implement an MPI_BCAST?

• Reduction– Data is reduced (summed, multiplied, counted, …)

• Ideas on how to implement MPI_REDUCE?

Synchronization Trees

• The topology of the messages that implement a reduction is important.

• LogP lecture illustrated a tree that wasn’t linear and wasn’t binary– Major idea was to maintain concurrency

• MPI has many operations as synchronizing so that implementers have a choice in optimization.

cse 160 – lecture 16

Documents