programming clusters using message-passing interface (mpi)

53
Programming Clusters using Message-Passing Interface (MPI) Dr. Rajkumar Buyya Cloud Computing and Distributed Systems (CLOUDS) Laboratory The University of Melbourne Melbourne, Australia www.cloudbus.org

Upload: candid

Post on 12-Jan-2016

48 views

Category:

Documents


0 download

DESCRIPTION

Programming Clusters using Message-Passing Interface (MPI). Clou d Computing and D istributed S ystems (CLOUDS) Laboratory The University of Melbourne Melbourne, Australia www.cloudbus.org. Dr. Rajkumar Buyya. Outline. Introduction to Message Passing Environments HelloWorld MPI Program - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Programming Clusters using Message-Passing Interface (MPI)

Programming Clusters using Message-Passing Interface

(MPI)

Dr. Rajkumar BuyyaCloud Computing and Distributed Systems (CLOUDS)

Laboratory The University of MelbourneMelbourne, Australiawww.cloudbus.org

Page 2: Programming Clusters using Message-Passing Interface (MPI)

Outline

Introduction to Message Passing Environments

HelloWorld MPI Program Compiling and Running MPI programs

On interactive clusters And Batch clusters

Elements of Hello World Program MPI Routines Listing Communication in MPI programs Summary

Page 3: Programming Clusters using Message-Passing Interface (MPI)

Message-Passing Programming Paradigm

Each processor in a message-passing program runs a sub-program

written in a conventional sequential language all variables are private communicate via special subroutine calls

M

P

M

P

M

P

Memory

Processors/Node

Interconnection Network

Page 4: Programming Clusters using Message-Passing Interface (MPI)

SPMD: A dominant paradigm for writing data parallel applications

main(int argc, char **argv){

if(process is assigned Master role){

/* Assign work and coordinate workers and collect results */ MasterRoutine(/*arguments*/);

} else /* it is worker process */

{ /* interact with master and other workers. Do the work

and send results to the master*/WorkerRoutine(/*arguments*/);

}}

Page 5: Programming Clusters using Message-Passing Interface (MPI)

Messages

Messages are packets of data moving between sub-programs.

The message passing system has to be told the following information

Sending processor Source location Data type Data length Receiving processor(s) Destination location Destination size

Page 6: Programming Clusters using Message-Passing Interface (MPI)

Messages

Access: Each sub-program needs to be connected to a message

passing system Addressing:

Messages need to have addresses to be sent to Reception:

It is important that the receiving process is capable of dealing with the messages it is sent

A message passing system is similar to: Post-office, Phone line, Fax, E-mail, etc

Message Types: Point-to-Point, Collective, Synchronous

(telephone)/Asynchronous (Postal)

Page 7: Programming Clusters using Message-Passing Interface (MPI)

Message Passing Systems and MPI

- www.mpi-forum.org Initially, each manufacturer developed their own message

passing interface Wide range of features, often incompatible MPI Forum brought together several Vendors and users of

HPC systems from US and Europe – overcome above limitations

Produced a document defining a standard, called Message Passing Interface (MPI), which is derived from experience or common features/issues addressed by many message-passing libraries. It aimed:

to provide source-code portability to allow efficient implementation to provide a high level of functionality to support heterogeneous parallel architectures to support parallel I/O (in MPI 2.0)

MPI 1.0 contains over 115 routines/functions

Page 8: Programming Clusters using Message-Passing Interface (MPI)

General MPI Program Structure

MPI Include File

Initialise MPI Environment

Do work and perform message communication

Terminate MPI Environment

Page 9: Programming Clusters using Message-Passing Interface (MPI)

MPI programs

MPI is a library - there are NO language changes

Header Files C: #include <mpi.h>

MPI Function Format C: error = MPI_Xxxx(parameter,...);

MPI_Xxxx(parameter,...);

Page 10: Programming Clusters using Message-Passing Interface (MPI)

Example - C

#include <mpi.h>

/* include other usual header files*/

main(int argc, char **argv)

{

/* initialize MPI */

MPI_Init(&argc, &argv);

/* main part of program */

/* terminate MPI */

MPI_Finalize();

exit(0);

}

Page 11: Programming Clusters using Message-Passing Interface (MPI)

MPI helloworld.c

#include <mpi.h>#include <stdio.h>

intmain(int argc, char **argv){ int numtasks, rank;

MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, & numtasks); MPI_Comm_rank(MPI_COMM_WORLD, &rank);

printf("Hello World from process %d of %d\n", rank, numtasks);

MPI_Finalize();return (0);

}

Page 12: Programming Clusters using Message-Passing Interface (MPI)

MPI Programs Compilation and Execution

Page 13: Programming Clusters using Message-Passing Interface (MPI)

Compile and Run Commands(LAM MPI)

Compile: > mpicc helloworld.c -o helloworld

Run: > lamboot machines.list [hosts picked from

configuration] > mpirun -np 3 helloworld

The file machines.list contains nodes list: manjra.cs.mu.oz.au node1 node2 .. node6 node13

No of processes

Page 14: Programming Clusters using Message-Passing Interface (MPI)

Sample Run and Output

A Run with 3 Processes: > lamboot > mpirun -np 3 helloworld

Hello World from process 0 of 3 Hello World from process 1 of 3 Hello World from process 2 of 3

Page 15: Programming Clusters using Message-Passing Interface (MPI)

Sample Run and Output

A Run with 6 Processes: > lamboot > mpirun -np 6 helloworld

Hello World from process 0 of 6 Hello World from process 3 of 6 Hello World from process 1 of 6 Hello World from process 5 of 6 Hello World from process 4 of 6 Hello World from process 2 of 6

Note: Process execution need not be in process number order.

Page 16: Programming Clusters using Message-Passing Interface (MPI)

Sample Run and Output

A Run with 6 Processes: > lamboot > mpirun -np 6 helloworld

Hello World from process 0 of 6 Hello World from process 3 of 6 Hello World from process 1 of 6 Hello World from process 2 of 6 Hello World from process 5 of 6 Hello World from process 4 of 6

Note: Change in process output order. For each run, process mapping can be different. They may run on machines with different load. Hence such difference.

Page 17: Programming Clusters using Message-Passing Interface (MPI)

More on MPI Program Elements and Error Checking

Page 18: Programming Clusters using Message-Passing Interface (MPI)

Initializing MPI

The first MPI routine called in any MPI program must be MPI_Init.

The C version accepts the arguments to main

int MPI_Init(int *argc, char ***argv);

MPI_Init must be called by every MPI program

Making multiple MPI_Init calls is erroneous

Page 19: Programming Clusters using Message-Passing Interface (MPI)

MPI_COMM_WORLD

MPI_INIT defines a communicator called MPI_COMM_WORLD for every process that calls it

All MPI communication calls require a communicator argument

MPI processes can only communicate if they share a communicator.

A communicator contains a group which is a list of processes

Each process has it’s rank within the communicator

A process can have several communicators

Page 20: Programming Clusters using Message-Passing Interface (MPI)

MPI_COMM_WORLD

MPI_COMM_WORLD

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

User-createdCommunicator

21

3 4 5

876

0

1

0

User-createdCommunicator

Page 21: Programming Clusters using Message-Passing Interface (MPI)

Communicators

MPI uses objects called Communicators that defines which collection of processes communicate with each other.

Every process has a unique integer identifier, or rank, assigned by the system when the process initialises

A rank is sometimes called process ID Processes can request information from a

communicator MPI_Comm_rank(MPI_comm comm, int *rank)

Returns the rank of the process in comm MPI_Comm_size(MPI_Comm comm, int *size)

Returns the size of the group in comm

Page 22: Programming Clusters using Message-Passing Interface (MPI)

Finishing up

An MPI program should call MPI_Finalize when all communications have completed

Once called, no other MPI calls can be made

Aborting:MPI_Abort(comm)

Attempts to abort all processes listed in comm

If comm = MPI_COMM_WORLD the whole program terminates

Page 23: Programming Clusters using Message-Passing Interface (MPI)

Hello World with Error Check

Page 24: Programming Clusters using Message-Passing Interface (MPI)

Display Hostname of MPI Process

#include <mpi.h>main(int argc, char **argv){ int numtasks, rank; int resultlen; static char mpi_hostname[MPI_MAX_PROCESSOR_NAME];

MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Get_processor_name( mpi_hostname, &resultlen );

printf("Hello World from process %d of %d running on %s\n", rank, numtasks, mpi_hostname);

MPI_Finalize();}

Page 25: Programming Clusters using Message-Passing Interface (MPI)

MPI Routines

Page 26: Programming Clusters using Message-Passing Interface (MPI)

MPI Routines – C and Fortran

Environment Management Point-to-Point Communication Collective Communication Process Group Management Communicators Derived Type Virtual Topologies Miscellaneous Routines

Page 27: Programming Clusters using Message-Passing Interface (MPI)

Environment Management Routines

Page 28: Programming Clusters using Message-Passing Interface (MPI)

Point-to-Point Communication

A simplest form of message passing One process sends a message to another Several variations on how sending a

message can interact with execution of the sub-program

Page 29: Programming Clusters using Message-Passing Interface (MPI)

Point-to-Point variations

Synchronous Sends provide information about the completion of the message e.g. fax machines

Asynchronous Sends Only know when the message has left e.g. post cards

Blocking operations only return from the call when operation has completed

Non-blocking operations return straight away - can test/wait later for completion

Page 30: Programming Clusters using Message-Passing Interface (MPI)

Point-to-Point Communication

Page 31: Programming Clusters using Message-Passing Interface (MPI)

Collective Communications

Collective communication routines are higher level routines involving several processes at a time

Can be built out of point-to-point communications

Barriers synchronise processes

Broadcast one-to-many communication

Reduction operations combine data from several processes to produce a single

(usually) result

Page 32: Programming Clusters using Message-Passing Interface (MPI)

Collective Communication Routines

Page 33: Programming Clusters using Message-Passing Interface (MPI)

Process Group Management Routines

Page 34: Programming Clusters using Message-Passing Interface (MPI)

Communicators Routines

Page 35: Programming Clusters using Message-Passing Interface (MPI)

Derived Type Routines

Page 36: Programming Clusters using Message-Passing Interface (MPI)

Virtual Topologies Routines

Page 37: Programming Clusters using Message-Passing Interface (MPI)

Miscellaneous Routines

Page 38: Programming Clusters using Message-Passing Interface (MPI)

MPI Communication Routines and Examples

Page 39: Programming Clusters using Message-Passing Interface (MPI)

MPI Messages

A message contains a number of elements of some particular data type

MPI data types Basic Types Derived types

Derived types can be built up from basic types

“C” types are different from Fortran types

Page 40: Programming Clusters using Message-Passing Interface (MPI)

MPI Basic Data types - C

MPI datatype C datatypeMPI_CHAR signed charMPI_SHORT signed short intMPI_INT signed intMPI_LONG signed long intMPI_UNSIGNED_CHAR unsigned charMPI_UNSIGNED_SHORT unsigned short intMPI_UNSIGNED unsigned intMPI_UNSIGNED_LONG unsigned long intMPI_FLOAT floatMPI_DOUBLE doubleMPI_LONG_DOUBLE long doubleMPI_BYTEMPI_PACKED

Page 41: Programming Clusters using Message-Passing Interface (MPI)

Point-to-Point Communication

Communication between two processes Source process sends message to destination

process Communication takes place within a

communicator Destination process is identified by its rank in

the communicator MPI provides four communication modes for

sending messages standard, synchronous, buffered, and ready

Only one mode for receiving

Page 42: Programming Clusters using Message-Passing Interface (MPI)

Standard Send

Completes once the message has been sent Note: it may or may not have been received

Programs should obey the following rules: It should not assume the send will complete before the

receive begins - can lead to deadlock It should not assume the send will complete after the

receive begins - can lead to non-determinism Processes should be eager readers - they should

guarantee to receive all messages sent to them - else network overload

Can be implemented as either a buffered send or synchronous send

Page 43: Programming Clusters using Message-Passing Interface (MPI)

Standard Send (cont.)

MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)

buf the address of the data to be sentcount the number of elements of datatype buf containsdatatype the MPI datatypedest rank of destination in communicator commtag a marker used to distinguish different message typescomm the communicator shared by sender and receiverierror the fortran return value of the send

Page 44: Programming Clusters using Message-Passing Interface (MPI)

Standard Blocking Receive

Note: all sends so far have been blocking (but this only makes a difference for synchronous sends)

Completes when message receivedMPI_Recv(buf, count, datatype, source, tag, comm,

status)

source - rank of source process in communicator commstatus - returns information about message

Synchronous Blocking Message-Passing processes synchronise sender process specifies the synchronous mode blocking - both processes wait until transaction completed

Page 45: Programming Clusters using Message-Passing Interface (MPI)

For a communication to succeed

Sender must specify a valid destination rank Receiver must specify a valid source rank The communicator must be the same Tags must match Message types must match Receivers buffer must be large enough Receiver can use wildcards

MPI_ANY_SOURCE MPI_ANY_TAG actual source and tag are returned in status parameter

Page 46: Programming Clusters using Message-Passing Interface (MPI)

Standard/Blocked Send/Receive

Page 47: Programming Clusters using Message-Passing Interface (MPI)

Standard/Blocked Send/Receive

Page 48: Programming Clusters using Message-Passing Interface (MPI)

MPI Send/Receive a Character (cont...)

// mpi_com.c #include <mpi.h> #include <stdio.h> int main(int argc, char *argv[]) { int numtasks, rank, dest, source, rc, tag=1; char inmsg, outmsg='X'; MPI_Status Stat;

MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (rank == 0) { dest = 1; rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); printf("Rank0 sent: %c\n", outmsg); source = 1; rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat); }

Page 49: Programming Clusters using Message-Passing Interface (MPI)

MPI Send/Receive a Character (cont...)

else if (rank == 1) { source = 0; rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat); printf("Rank1 received: %c\n", inmsg); dest = 0; rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); }

MPI_Finalize(); }

Page 50: Programming Clusters using Message-Passing Interface (MPI)

Execution Demo

mpicc mpi_com.c > lamboot > mpirun -np 2 ./a.out

Rank0 sent: XRank0 recv: YRank1 received: X

Page 51: Programming Clusters using Message-Passing Interface (MPI)

Non Blocking Message Passing

Page 52: Programming Clusters using Message-Passing Interface (MPI)

Non Blocking Message Passing

Functions to check on completion: MPI_Wait, MPI_Test, MPI_Waitany, MPI_Testany, MPI_Waitall, MPI_Testall, MPI_Waitsome, MPI_Testsome. MPI_Status status;

MPI_Wait(&request, &status) /* block */

MPI_Test(&request, &flag, &status) /* doesn’t block */

Page 53: Programming Clusters using Message-Passing Interface (MPI)

Timers

C: double MPI_Wtime(void); Returns an elapsed wall clock time in seconds (double

precision) on the calling processor.

Time is measured in seconds Time to perform a task is measured by consulting the

time before and after