cse 160 – lecture 16
DESCRIPTION
CSE 160 – Lecture 16. MPI Concepts, Topology and Synchronization. So Far …. We have concentrated on generic message passing algorithms We have used PVM to implement codes (simpler on workstations) MPI is a defacto standard for message passing and introduces some important concepts. - PowerPoint PPT PresentationTRANSCRIPT
CSE 160 – Lecture 16
MPI Concepts, Topology and Synchronization
So Far …
• We have concentrated on generic message passing algorithms
• We have used PVM to implement codes (simpler on workstations)
• MPI is a defacto standard for message passing and introduces some important concepts
Message Addressing
• Identify an endpoint • Use a tag to distinguish a particular message
– pvm_send(dest, tag)
– MPI_SEND(COMM, dest, tag, buf, len, type)
• Receiving– recv(src, tag); recv(*,tag); recv (src, *); recv(*,*);
• What if you want to build a library that uses message passing? Is (src, tag) safe in all instances?
Level O Issues
• Basic Pt-2-Pt Message Passing is straightforward, but how does one …– Make it go fast
• Eliminate extra memory copies• Take advantage of specialized hardware
– Move complex data structures (packing)– Receive from one-of-many (wildcarding)– Synchronize a group of tasks– Recover from errors– Start tasks– Build safe libraries– Monitor tasks– …
MPI-1 addresses many of the level 0 issues
(but not all)
A long history of research efforts in message passing
• P4• Chameleon• Parmacs• TCGMSG• CHIMP• NX (Intel i860, Paragon)• PVM• …
And these begot MPI
So What is MPI
• It is a standard message passing API – Specifies many variants of send/recv
• 9 send interface calls– Eg., synchronous send, asynchronous send, ready send,
asynchronous ready send
– Plus other defined APIs• Process topologies• Group operations• Derived Data types• Profiling API (standard way to instrument MPI code)
– Implemented and optimized by machine vendors– Should you use it? Absolutely!
So What’s Missing in MPI-1?
• Process control– How do you start 100 tasks?– How do you kill/signal/monitor remote tasks
• I/O– Addressed in MPI-2
• Fault-tolerance– One MPI process dies, the rest eventually hang
• Interoperability– No standard set for running a single parallel job across
architectures (eg. Cannot split computation between x86 Linux and Alpha)
Communication Envelope
• (src, tag) is enough to distinguish messages however, there is a problem:– How does one write safe libraries that also use
message passing?• Tags used in libraries can be the same as tags used
in calling code messages could get confused
• Message envelope – add another, system-assigned tag, to a message– This label defines a message envelope
Envelopes
Tag 100, Src A
Tag 100, Src B
Tag 200, Src A
Envelope
Tag 100, Src A
Tag 100, Src B
Tag 200, Src A
Envelope
Context tag – messages received relative to context
Communicators
• Messages ride inside of message envelopes and receives can only match candidate messages relative to an envelope.
• MPI implements these envelopes as communicators.
• Communicators are more than just and envelope – it also encompasses a group of processes
• Communicator = context + group of tasks
MPI_COMM_WORLD
• MPI programs are static – they can never grow or shrink
• There is one default communicator called MPI_COMM_WORLD
• All processes are members of COMM_WORLD• Each process is labeled 0 – N, for a communicator
of size N+1– Similar to a PVM group.
How do you build new groups?
• All new communicators are derived from existing communicators– All communicators have
MPI_COMM_WORLD as an “ancestor”– New communicators are formed synchronously
• All members of a new communicator call the construction routine at the same time.
• MPI_COMM_DUP(), MPI_COMM_SPLIT()
Communicators as groups
0
1
3
4
2
5
1
0
2
0 1
2
Running MPI Programs
• The MPI-1 Standard does not specify how to run an MPI
program, just as the Fortran standard does not specify how to
run a Fortran program.
• In general, starting an MPI program is dependent on the
implementation of MPI you are using, and might require various
scripts, program arguments, and/or environment variables.
• mpiexec <args> is part of MPI-2, as a recommendation,
but not a requirement
– You can use mpiexec for MPICH and mpirun for other systems
Finding Out About the Environment
• Two important questions that arise early in a parallel program are:– How many processes are participating in this
computation?– Which one am I?
• MPI provides functions to answer these questions:– MPI_Comm_size reports the number of processes.– MPI_Comm_rank reports the rank, a number between 0
and size-1, identifying the calling process
MPI Datatypes
• The data in a message to sent or received is described by a triple (address, count, datatype), where
• An MPI datatype is recursively defined as:– predefined, corresponding to a data type from the language (e.g.,
MPI_INT, MPI_DOUBLE_PRECISION)– a contiguous array of MPI datatypes– a strided block of datatypes– an indexed array of blocks of datatypes– an arbitrary structure of datatypes
• There are MPI functions to construct custom datatypes, such an array of (int, float) pairs, or a row of a matrix stored columnwise.
MPI Tags
• Messages are sent with an accompanying user-defined integer tag, to assist the receiving process in identifying the message.
• Messages can be screened at the receiving end by specifying a specific tag, or not screened by specifying MPI_ANY_TAG as the tag in a receive.
• Some non-MPI message-passing systems have called tags “message types”. MPI calls them tags to avoid confusion with datatypes.
Why Datatypes?
• Since all data is labeled by type, an MPI implementation can support communication between processes on machines with very different memory representations and lengths of elementary datatypes (heterogeneous communication).
• Specifying application-oriented layout of data in memory– reduces memory-to-memory copies in the implementation
– allows the use of special hardware (scatter/gather) when available
• Theoretically simplifies the programmer’s life
• How would you build “Datatypes” in PVM – sequences of pvm_pack()
Introduction to Collective Operations in MPI
• Collective operations are called by all processes in a communicator.
• MPI_BCAST distributes data from one process (the root) to all others in a communicator.– How does this differ from pvm_mcast(), pvm_broadcast()?
• MPI_REDUCE combines data from all processes in communicator and returns it to one process.
• In many numerical algorithms, SEND/RECEIVE can be replaced by BCAST/REDUCE, improving both simplicity and efficiency.
Example: PI in C -1#include "mpi.h"#include <math.h>int main(int argc, char *argv[]){
int done = 0, n, myid, numprocs, i, rc;double PI25DT = 3.141592653589793238462643;double mypi, pi, h, sum, x, a;MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD,&numprocs);MPI_Comm_rank(MPI_COMM_WORLD,&myid);while (!done) { if (myid == 0) { printf("Enter the number of intervals: (0 quits) "); scanf("%d",&n); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); if (n == 0) break;
Example: PI in C - 2 h = 1.0 / (double) n;
sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += 4.0 / (1.0 + x*x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf("pi is approximately %.16f, Error is %.16f\n", pi, fabs(pi - PI25DT));}MPI_Finalize();
return 0;}
• Send a large message from process 0 to process 1– If there is insufficient storage at the destination, the send must wait for the user to
provide the memory space (through a receive)
• What happens with
Sources of Deadlocks
Process 0
Send(1)Recv(1)
Process 1
Send(0)Recv(0)
• This is called “unsafe” because it depends on the availability of system buffers (pvm does buffering)
Some Solutions to the “unsafe” Problem
• Order the operations more carefully:Process 0
Send(1)Recv(1)
Process 1
Recv(0)Send(0)
• Use non-blocking operations:
Process 0
Isend(1)Irecv(1)Waitall
Process 1
Isend(0)Irecv(0)Waitall
Extending the Message-Passing Interface
• Dynamic Process Management– Dynamic process startup– Dynamic establishment of connections
• One-sided communication– Put/get– Other operations
• Parallel I/O• Other MPI-2 features
– Generalized requests– Bindings for C++/ Fortran-90; interlanguage issues
When to use MPI
• Portability and Performance
• Irregular Data Structures
• Building Tools for Others– Libraries
• Need to Manage memory on a per processor basis
When not to use MPI
• Regular computation matches HPF– But see PETSc/HPF comparison (ICASE 97-72)
• Solution (e.g., library) already exists– http://www.mcs.anl.gov/mpi/libraries.html
• Require Fault Tolerance– Sockets
• Distributed Computing– CORBA, DCOM, etc.
Synchronization
• Messages serve two purposes:– Moving data– Synchronizing processes
• When does synchronization occur for– Blocking send?
– Non-blocking send?
» Dependencies on implementation
Common Synchronization Constructs
• Barrier– All processes stop, No information passed
• Broadcast– All processes get information from the root
• How does this differ from a barrier
• Ideas on how to implement an MPI_BCAST?
• Reduction– Data is reduced (summed, multiplied, counted, …)
• Ideas on how to implement MPI_REDUCE?
Synchronization Trees
• The topology of the messages that implement a reduction is important.
• LogP lecture illustrated a tree that wasn’t linear and wasn’t binary– Major idea was to maintain concurrency
• MPI has many operations as synchronizing so that implementers have a choice in optimization.