isbi mpi tutorial

86
MPI & Distributed Computing Eric Borisch, M.S. Mayo Clinic

Upload: daniel-blezek

Post on 11-May-2015

2.111 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: ISBI MPI Tutorial

MPI & Distributed Computing

Eric Borisch, M.S.Mayo Clinic

Page 2: ISBI MPI Tutorial

Topics

Motivation for distributed computingWhat MPI is Intro to MPI programmingThinking in parallelWrap up

Page 3: ISBI MPI Tutorial

Shared vs. Distributed Memory Shared Memory: all memory within a system is

directly addressable (ignoring access restrictions) by each process [or thread] Single- and multi-CPU desktops & laptops Multi-threaded apps GPGPU * MPI *

Distributed Memory: memory available a given node within a system is unique and distinct from its peers MPI Google MapReduce / Hadoop

Page 4: ISBI MPI Tutorial

Why bother?

1 2 3 4 5 6 7 80

0.5

1

1.5

2

2.5

Centos 5.2; Dual Quad-Core 3GHz P4 [E5472]; DDR2 800MHz

CopyScaleAddTriad

# of processes

Rela

tive p

erf

orm

an

ce

http://www.cs.virginia.edu/stream/

Page 5: ISBI MPI Tutorial

But what about Nehalem?

0 4 8 12 160%

50%100%150%200%250%300%350%400%

STREAM benchmark OpenMP per-formance

Add:Copy:Scale:Triad:

Threads (8 Physical cores + HT)

Rela

tive p

erf

orm

an

ce

http://www.cs.virginia.edu/stream/2x X5570 (2.93GHz; Quad-core; 6.4GT/s QPI); 12x4G 1033 DDR3

Page 6: ISBI MPI Tutorial

Memory Limitations

Bandwidth (FSB, HT, Nehalem, CUDA, …) Frequently run into with high-level languages (MATLAB)

Capacity – cost & availability High-density chips are $$$ (if even available) Memory limits on individual systems

Distributed computing addresses both bandwidth and capacity with multiple systems

MPI is the glue used to connect multiple distributed processes together

Page 7: ISBI MPI Tutorial

Memory Requirements [Example]

Custom iterative SENSE reconstruction 3 x 8 coils x 400 x 320 x 176 x 8 [complex

float] Profile data (img space) Estimate (img <-> k space) Acquired data (k space) > 4GB data touched during each iteration

16, 32 channel data here or on the way…

Trzasko, Josh ISBI 2009 #1349, “Practical Nonconvex Compressive Sensing Reconstruction of Highly-Accelerated 3D Parallel MR Angiograms”M01.R4: MRI Reconstruction Algorithms, Monday @ 1045 in Stanbro

Page 8: ISBI MPI Tutorial

FTx

Real-time SENSE unfoldingDATA

Place view into correct x-Ky-Kz space (AP & LP)

“Traditional” 2D SENSE Unfold (AP & LP)

Homodyne Correction

GW Correction (Y, Z)

GW Correction (X)

MIP

Store / DICOM

FTyz (AP & LP)CAL

RESULT

Root node

Worker nodes

Real-time data

Pre-loaded data

MPI Communication

Page 9: ISBI MPI Tutorial

Root Node

3.6GHz P4

3.6GHz P4

16GB RAM

80GB HDD

1Gb Eth

2x8Gb IB

Worker Node (x7)3.6GHz P4

3.6GHz P4

16GB RAM

80GB HDD2x8Gb IB

24-Port Infiniband Switch

16-Port Gigabit Ethernet Switch

1Gb Eth

MRI System

Site Intranet

x7x2 MPI interconnects16Gb/s bandwidth per node

x7 File system connections

KeyCluster Hardware

External Hardware

2x8Gig Infiniband connection

1Gig Ethernet connection

500GB HDD

1Gb Eth

1Gb Eth

8Gb/s Connection

MRI Reconstruction Cluster

Page 10: ISBI MPI Tutorial

Many Approaches to “Distributed”

Loosely coupled SETI / BOINC “Grid computing”

BIOS-level abstraction ScaleMP

Tightly coupled MPI “Cluster computing”

Hybrid Folding@Home gpugrid.net

http://en.wikipedia.org/wiki/File:After_Dark_Flying_Toasters.pnghttp://en.wikipedia.org/wiki/File:Setiathomeversion3point08.png

Page 11: ISBI MPI Tutorial

Grid vs. Cluster

Master

Worker

Worker

Worker

Worker

WorkerHead Node

WorkerNode

Interconnect

Page 12: ISBI MPI Tutorial

Shared vs. Distributed

HostOS

Process A

Process B

Thread 1

Thread 2

Thread N

Host IOS I

Process A

Host IIOS II

Process B

Host NOS N

Process CMemory Transfers

Network Transfers

Page 13: ISBI MPI Tutorial

Shared vs. Distributed

HostOS

Process A

Process B

Thread 1

Thread 2

Thread N

Host IOS I

Process A

Host IIOS II

Process B

Host NOS N

Process CMemory Transfers

Network Transfers

Process D

Process E

Process F

Page 14: ISBI MPI Tutorial

Topics

Motivation for distributed computingWhat MPI is Intro to MPI programmingThinking in parallelWrap up

Page 15: ISBI MPI Tutorial

MPI

Message Passing Interface is… “a library specification for message-

passing” 1

Available in many implementations on multiple platforms *

A set of functions for moving messages between different processes without a shared memory environment

Low-level*; no concept of overall computing tasks to be performed

[1] http://www.mcs.anl.gov/research/projects/mpi/

Page 16: ISBI MPI Tutorial

MPI history

MPI-1 Version 1.0 draft standard 1994 Version 1.1 in 1995 Version 1.2 in 1997 Version 1.3 in 2008

MPI-2 Added:▪ 1-sided communication▪ Dynamic “world” sizes; spawn / join

Version 2.0 in 1997 Version 2.1 in 2008

MPI-3 In process Enhanced fault handling

Forward compatibility preserved

Page 17: ISBI MPI Tutorial

MPI Status

MPI is the de-facto standard for distributed computing Freely available Open source implementations exist Portable Mature

From a discussion of why MPI is dominant [1]: […] 100s of languages have come and gone. Good stuff must have been created [… yet] it is broadly accepted in the

field that they’re not used. MPI has a lock. OpenMP is accepted, but a distant second. There are substantial barriers to the introduction of new languages and

language constructs. Economic, ecosystem related, psychological, a catch-22 of widespread

use, etc. Any parallel language proposal must come equipped with reasons why it

will overcome those barriers.[1] http://www.ieeetcsc.org/newsletters/2006-01/why_all_mpi_discussion.html

Page 18: ISBI MPI Tutorial

MPI Distributions

MPI itself is just a specification. We want an implementation MPICH, MPICH2

Widely portable MVAPICH, MVAPICH2

Infiniband-centric; MPICH/MPICH2 based OpenMPI

Plug-in architecture; many run-time options And more:

IntelMPI HP-MPI MPI for IBM Blue Gene MPI for Cray Microsoft MPI MPI for SiCortex MPI for Myrinet Express (MX) MPICH2 over SCTP

Page 19: ISBI MPI Tutorial

Implementing a distributed system

Without MPI: Start all of the processes across bank of

machines (shell scripting + ssh) socket(), bind(), listen(), accept() or

connect() each link send(), read() on individual links Raw byte interfaces; no discrete

messages

Page 20: ISBI MPI Tutorial

Implementing a distributed system

With MPI mpiexec –np <n> app MPI_Init() MPI_Send() MPI_Recv() MPI_Finalize()

MPI: Manages the connections Packages messages Provides launching mechanism

Page 21: ISBI MPI Tutorial

MPI (the document)1

Provides definitions for: Communication functions

MPI_Send() MPI_Recv() MPI_Bcast() etc.

Datatype mangement functions MPI_Type_create_hvector()

C, C++, and Fortran bindings Also recommends process startup

mpiexec –np <nproc> <program> <args>[1] http://www.mpi-forum.org/docs/mpi-20-html/mpi2-report.html

Page 22: ISBI MPI Tutorial

MPI FunctionsMPI_AbortMPI_AccumulateMPI_Add_error_classMPI_Add_error_codeMPI_Add_error_stringMPI_AddressMPI_AllgatherMPI_AllgathervMPI_Alloc_memMPI_AllreduceMPI_AlltoallMPI_AlltoallvMPI_AlltoallwMPI_Attr_deleteMPI_Attr_getMPI_Attr_putMPI_BarrierMPI_BcastMPI_BsendMPI_Bsend_initMPI_Buffer_attachMPI_Buffer_detachMPI_CancelMPI_Cart_coordsMPI_Cart_createMPI_Cart_getMPI_Cart_mapMPI_Cart_rankMPI_Cart_shiftMPI_Cart_subMPI_Cartdim_getMPI_Close_portMPI_Comm_acceptMPI_Comm_call_errhandlerMPI_Comm_compareMPI_Comm_connectMPI_Comm_createMPI_Comm_create_errhandlerMPI_Comm_create_keyvalMPI_Comm_delete_attrMPI_Comm_disconnectMPI_Comm_dupMPI_Comm_freeMPI_Comm_free_keyvalMPI_Comm_get_attrMPI_Comm_get_errhandlerMPI_Comm_get_nameMPI_Comm_get_parentMPI_Comm_groupMPI_Comm_joinMPI_Comm_rankMPI_Comm_remote_groupMPI_Comm_remote_sizeMPI_Comm_set_attrMPI_Comm_set_errhandlerMPI_Comm_set_nameMPI_Comm_sizeMPI_Comm_spawnMPI_Comm_spawn_multipleMPI_Comm_splitMPI_Comm_test_interMPI_Dims_createMPI_Errhandler_createMPI_Errhandler_freeMPI_Errhandler_getMPI_Errhandler_setMPI_Error_classMPI_Error_stringMPI_ExscanMPI_File_c2fMPI_File_call_errhandlerMPI_File_closeMPI_File_create_errhandlerMPI_File_deleteMPI_File_f2cMPI_File_get_amodeMPI_File_get_atomicityMPI_File_get_byte_offsetMPI_File_get_errhandlerMPI_File_get_groupMPI_File_get_infoMPI_File_get_positionMPI_File_get_position_sharedMPI_File_get_sizeMPI_File_get_type_extent

MPI_File_get_viewMPI_File_ireadMPI_File_iread_atMPI_File_iread_sharedMPI_File_iwriteMPI_File_iwrite_atMPI_File_iwrite_sharedMPI_File_openMPI_File_preallocateMPI_File_readMPI_File_read_allMPI_File_read_all_beginMPI_File_read_all_endMPI_File_read_atMPI_File_read_at_allMPI_File_read_at_all_beginMPI_File_read_at_all_endMPI_File_read_orderedMPI_File_read_ordered_beginMPI_File_read_ordered_endMPI_File_read_sharedMPI_File_seekMPI_File_seek_sharedMPI_File_set_atomicityMPI_File_set_errhandlerMPI_File_set_infoMPI_File_set_sizeMPI_File_set_viewMPI_File_syncMPI_File_writeMPI_File_write_allMPI_File_write_all_beginMPI_File_write_all_endMPI_File_write_atMPI_File_write_at_allMPI_File_write_at_all_beginMPI_File_write_at_all_endMPI_File_write_orderedMPI_File_write_ordered_beginMPI_File_write_ordered_endMPI_File_write_sharedMPI_FinalizeMPI_FinalizedMPI_Free_memMPI_GatherMPI_GathervMPI_GetMPI_Get_addressMPI_Get_countMPI_Get_elementsMPI_Get_processor_nameMPI_Get_versionMPI_Graph_createMPI_Graph_getMPI_Graph_mapMPI_Graph_neighborsMPI_Graph_neighbors_countMPI_Graphdims_getMPI_Grequest_completeMPI_Grequest_startMPI_Group_compareMPI_Group_differenceMPI_Group_exclMPI_Group_freeMPI_Group_inclMPI_Group_intersectionMPI_Group_range_exclMPI_Group_range_incl

MPI_Group_rankMPI_Group_sizeMPI_Group_translate_ranksMPI_Group_unionMPI_IbsendMPI_Info_createMPI_Info_deleteMPI_Info_dupMPI_Info_freeMPI_Info_getMPI_Info_get_nkeysMPI_Info_get_nthkeyMPI_Info_get_valuelenMPI_Info_setMPI_InitMPI_Init_threadMPI_InitializedMPI_Intercomm_createMPI_Intercomm_mergeMPI_IprobeMPI_IrecvMPI_IrsendMPI_Is_thread_mainMPI_IsendMPI_IssendMPI_Keyval_createMPI_Keyval_freeMPI_Lookup_nameMPI_Op_createMPI_Op_freeMPI_Open_portMPI_PackMPI_Pack_externalMPI_Pack_external_sizeMPI_Pack_sizeMPI_PcontrolMPI_ProbeMPI_Publish_nameMPI_PutMPI_Query_threadMPI_RecvMPI_Recv_initMPI_ReduceMPI_Reduce_scatterMPI_Register_datarepMPI_Request_freeMPI_Request_get_statusMPI_RsendMPI_Rsend_initMPI_ScanMPI_ScatterMPI_ScattervMPI_SendMPI_Send_initMPI_SendrecvMPI_Sendrecv_replaceMPI_SsendMPI_Ssend_initMPI_StartMPI_StartallMPI_Status_set_cancelledMPI_Status_set_elementsMPI_TestMPI_Test_cancelledMPI_TestallMPI_TestanyMPI_TestsomeMPI_Topo_testMPI_Type_commitMPI_Type_contiguousMPI_Type_create_darrayMPI_Type_create_hindexedMPI_Type_create_hvectorMPI_Type_create_indexed_block

MPI_Type_create_keyvalMPI_Type_create_resizedMPI_Type_create_structMPI_Type_create_subarrayMPI_Type_delete_attrMPI_Type_dupMPI_Type_extentMPI_Type_freeMPI_Type_free_keyvalMPI_Type_get_attrMPI_Type_get_contentsMPI_Type_get_envelopeMPI_Type_get_extentMPI_Type_get_nameMPI_Type_get_true_extentMPI_Type_hindexedMPI_Type_hvectorMPI_Type_indexedMPI_Type_lbMPI_Type_match_sizeMPI_Type_set_attrMPI_Type_set_nameMPI_Type_sizeMPI_Type_structMPI_Type_ubMPI_Type_vectorMPI_UnpackMPI_Unpack_externalMPI_Unpublish_nameMPI_WaitMPI_WaitallMPI_WaitanyMPI_WaitsomeMPI_Win_call_errhandlerMPI_Win_completeMPI_Win_createMPI_Win_create_errhandlerMPI_Win_create_keyvalMPI_Win_delete_attrMPI_Win_fenceMPI_Win_freeMPI_Win_free_keyvalMPI_Win_get_attrMPI_Win_get_errhandlerMPI_Win_get_groupMPI_Win_get_nameMPI_Win_lockMPI_Win_postMPI_Win_set_attrMPI_Win_set_errhandlerMPI_Win_set_nameMPI_Win_startMPI_Win_testMPI_Win_unlockMPI_Win_waitMPI_WtickMPI_Wtime

Page 23: ISBI MPI Tutorial

The message passing mindset Each process owns their data – there is no “our”

Makes many things simpler; no mutexes, condition variables, semaphores, etc; memory access order race conditions go away

Every message is an explicit copy I have the memory I sent from, you have the memory

you used to received into Even when running in a “shared memory” environment

Synchronization comes along for free I won’t get your message (or data) until you choose to

send it Programming to MPI first can make it easier to

scale-out later

Page 24: ISBI MPI Tutorial

Topics

Motivation for distributed computingWhat MPI is Intro to MPI programmingThinking in parallelWrap up

Page 25: ISBI MPI Tutorial

Getting started with MPI

Download / decompress MPICH source: http://www.mcs.anl.gov/research/projects/mpich2/ Suports: c / c++ / Fortran Requires Python >= 2.2

./configure make install

installs into /usr/local by default, or use --prefix=<chosen path>

Make sure <prefix>/bin is in PATH Make sure <prefix>/share/man is in

MANPATH

Page 26: ISBI MPI Tutorial

MPI Installation

c compiler wrapper c++ compiler wrapper

MPI job launcher

MPD launcher

Page 27: ISBI MPI Tutorial

MPD launch

Set up passwordless ssh to workers Start the daemons with mpdboot -n <N>

Requires ~/.mpd.conf to exist on each host▪ Contains: (same on each host)▪ MPD_SECRETWORD=<some gibberish string>

▪ permissions set to 600 (r/w access for owner only) Requires ./mpd.hosts to list other host names▪ Unless run as mpdboot -n 1 (run on current host

only)▪ Will not accept current host in list (implicit)

Check for running daemons with mpdtrace

For details: http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-1.0.8-installguide.pdf

Page 28: ISBI MPI Tutorial

MPD launch

Page 29: ISBI MPI Tutorial

MPI Compile & launch

Use mpicc / mpicxx for c/c++ compiler Wrapper script around c/c++ compilers

detected during install▪ $ mpicc --showgcc -march=core2 -I/usr/local/builds/mpich2-1.0.8/include -L/usr/local/builds/mpich2-1.0.8/lib -lmpich -lhdf5_cpp -lhdf5 -lpthread -luuid -lpthread –lrt

$ mpicc -o hello hello.c Use mpiexec -np <nproc> <app>

<args> to launch $ mpiexec -np 4 ./hello

Page 30: ISBI MPI Tutorial

Hello, Hello, Hello, world world world

/* hello.c */#include <stdio.h>#include <mpi.h>

int main (int argc, char * argv[]){

int i, rank, nodes;

MPI_Init(&argc, &argv);

MPI_Comm_size(MPI_COMM_WORLD, &nodes);MPI_Comm_rank(MPI_COMM_WORLD, &rank);

for (i=0; i < nodes; i++){

MPI_Barrier(MPI_COMM_WORLD);if (i == rank) printf("Hello from %i of %i!\n", rank, nodes);

}MPI_Finalize();return 0;

}

$ mpicc -o hello hello.c$ mpiexec -np 4 ./hello

Hello, from 0 of 4!Hello, from 2 of 4!Hello, from 1 of 4!Hello, from 3 of 4!

Page 31: ISBI MPI Tutorial

Threads vs. MPI startup./

threaded_app

main()

pthread_create( func() ) func()

pthread_exit()pthread_join()

Memory

Do work Do work

exit()

Thread within threaded_app process

Page 32: ISBI MPI Tutorial

mpi_app [rank 3]

mpi_app [rank 0]

mpi_app [rank 1]

Threads vs. MPI startupmpiexec –np 4

./mpi_app

main() main() main()

MPI comm.MPI_Init() MPI_Init() MPI_Init()

MPI comm.MPI_Bcast() MPI_Bcast() MIP_Bcast()

Do Work on local mem

Do Work on local mem

Do Work on local mem

mpd launches jobs

MPI comm.MPI_Allreduce() MPI_Allreduce() MPI_Allreduce()

MPI comm.MPI_Finalize() MPI_Finalize() MPI_Finalize()

exit() exit() exit()

Page 33: ISBI MPI Tutorial

Hello, world: unique to ranks

/* hello.c */#include <stdio.h>#include <mpi.h>

intmain (int argc, char * argv[]){

int i;int rank;int nodes;

MPI_Init(&argc, &argv);

MPI_Comm_size(MPI_COMM_WORLD, &nodes);MPI_Comm_rank(MPI_COMM_WORLD, &rank);

for (i=0; i < nodes; i++){MPI_Barrier(MPI_COMM_WORLD);if (i == rank) printf("Hello from %i of %i!\n", rank, nodes);}MPI_Finalize();return 0;

}

Page 34: ISBI MPI Tutorial

MPE (Multi-Process Environment)

MPICH2 comes with mpe by default (unless disabled during configure)

Multiple tracing / logging options to track MPI traffic

Enabled through –mpe=<option> at compile timeMacPro:code$ mpicc -mpe=mpilog -o hello hello.cMacPro:code$ mpiexec -np 4 ./helloHello from 0 of 4!Hello from 2 of 4!Hello from 1 of 4!Hello from 3 of 4!Writing logfile....Enabling the Default clock synchronization...Finished writing logfile ./hello.clog2.

Page 35: ISBI MPI Tutorial

jumpshot view of log

Page 36: ISBI MPI Tutorial

Output with -mpe=mpitraceMacPro:code$ mpicc -mpe=mpitrace -o hello hello.cMacPro:code$ mpiexec -np 2 ./hello > trace

MacPro:code$ grep 0 trace [0] Ending MPI_Init[0] Starting MPI_Comm_size...[0] Ending MPI_Comm_size[0] Starting MPI_Comm_rank...[0] Ending MPI_Comm_rank[0] Starting MPI_Barrier...[0] Ending MPI_BarrierHello from 0 of 2![0] Starting MPI_Barrier...[0] Ending MPI_Barrier[0] Starting MPI_Finalize...[0] Ending MPI_Finalize

MacPro:code$ grep 1 trace [1] Ending MPI_Init[1] Starting MPI_Comm_size...[1] Ending MPI_Comm_size[1] Starting MPI_Comm_rank...[1] Ending MPI_Comm_rank[1] Starting MPI_Barrier...[1] Ending MPI_Barrier[1] Starting MPI_Barrier...[1] Ending MPI_BarrierHello from 1 of 2![1] Starting MPI_Finalize...[1] Ending MPI_Finalize

Page 37: ISBI MPI Tutorial

A more interesting log…

Page 38: ISBI MPI Tutorial

3D-sinc interpolation

Page 39: ISBI MPI Tutorial

MPI_Send (Blocking)

int MPI_Send( void *buf,

memory location to send from

int count,number of elements (of type datatype) at buf

MPI_Datatype datatype, MPI_INT, MPI_FLOAT, etc…Or custom datatypes; strided vectors; structures, etc

int dest, rank (within the communicator comm) of destination for this message

int tag, used to distinguish this message from other messages

MPI_Comm comm )communicator for this transferoften MPI_COMM_WORLD

Page 40: ISBI MPI Tutorial

MPI_Recv (Blocking)

int MPI_Recv(void *buf,

memory location to receive data into

int count, number of elements (of type datatype) available to receive into at buf

MPI_Datatype datatype,MPI_INT, MPI_FLOAT, etc…Or custom datatypes; strided vectors; structures, etc.Typically matches sending datatype, but doesn’t have to…

int source, rank (within the communicator comm) of source for this messagecan also be MPI_ANY_SOURCE

int tag, used to distinguish this message from other messagescan also be MPI_ANY_TAG

MPI_Comm comm, communicator for this transferoften MPI_COMM_WORLD

MPI_Status *status )Structure describing the received message, including:

actual count (can be smaller than passed count)source (useful if used with source = MPI_ANY_SOURCE)tag (useful if used with tag = MPI_ANY_TAG)

Page 41: ISBI MPI Tutorial

Another example

/* sr.c */#include <stdio.h>#include <mpi.h>#ifndef SENDSIZE#define SENDSIZE 1#endif

intmain (int argc, char * argv[] ){

int i, rank, nodes, myData[SENDSIZE], theirData[SENDSIZE];MPI_Status sendStatus;

MPI_Init(&argc, &argv);

MPI_Comm_size(MPI_COMM_WORLD, &nodes);MPI_Comm_rank(MPI_COMM_WORLD, &rank);

myData[0] = rank;MPI_Send(myData, SENDSIZE, MPI_INT, (rank + 1) % nodes, 0, MPI_COMM_WORLD);MPI_Recv(theirData, SENDSIZE, MPI_INT, (rank + nodes -1 ) % nodes, 0, MPI_COMM_WORLD, &sendStatus);

printf("%i sent %i; received %i\n", rank, myData[0], theirData[0]);

MPI_Finalize();return 0;

}

Page 42: ISBI MPI Tutorial

Does it run?

$ mpicc -o sr sr.c$ mpiexec -np 2 ./sr0 sent 0; received 11 sent 1; received 0

Page 43: ISBI MPI Tutorial

Log output (-np 4)

Page 44: ISBI MPI Tutorial

May != Will

$ mpicc -o sr sr.c$ mpiexec -np 2 ./sr0 sent 0; received 11 sent 1; received 0

$ mpicc -o sr sr.c -DSENDSIZE="0x1<<13”$ mpiexec -np 2 ./sr0 sent 0; received 11 sent 1; received 0

$ mpicc -o sr sr.c -DSENDSIZE="0x1<<14”$ mpiexec -np 2 ./sr^C

$ mpicc -o sr sr.c -DSENDSIZE="0x1<<14 - 1”$ mpiexec -np 2 ./sr0 sent 0; received 11 sent 1; received 0

Page 45: ISBI MPI Tutorial

What the standard has to say…3.4 Communication Modes

The send call described in Section Blocking send is blocking: it does not return until the message data and envelope have been safely stored away so that the sender is free to access and overwrite the send buffer. The message might be copied directly into the matching receive buffer, or it might be copied into a temporary system buffer.

Message buffering decouples the send and receive operations. A blocking send can complete as soon as the message was buffered, even if no matching receive has been executed by the receiver. On the other hand, message buffering can be expensive, as it entails additional memory-to-memory copying, and it requires the allocation of memory for buffering. MPI offers the choice of several communication modes that allow one to control the choice of the communication protocol.

The send call described in Section Blocking send used the standard communication mode. In this mode, it is up to MPI to decide whether outgoing messages will be buffered. MPI may buffer outgoing messages. In such a case, the send call may complete before a matching receive is invoked. On the other hand, buffer space may be unavailable, or MPI may choose not to buffer outgoing messages, for performance reasons. In this case, the send call will not complete until a matching receive has been posted, and the data has been moved to the receiver.

Thus, a send in standard mode can be started whether or not a matching receive has been posted. It may complete before a matching receive is posted. The standard mode send is non-local: successful completion of the send operation may depend on the occurrence of a matching receive.

http://www.mpi-forum.org/docs/mpi-11-html/node40.html#Node40

Page 46: ISBI MPI Tutorial

Rendezvous vs. eager (simplified)

Process 1 Process 2Send

“small” message &

returnEager send

Eager recv

Request & receive small

message

Send “large”

messageRndv. req.

Request large message

Receive Rndv. req.

Match Rndv. req.

Rndv. send

Receive Rndv. data

Receive large message

Blocks until completion.

User activityMPI activity

Page 47: ISBI MPI Tutorial

MPI communication modes MPI_Bsend (Buffered) (MPI_Ibsend, MPI_Bsend_init)

Sends are “local” – they return independent of any remote activity Message buffer can be touched immediately after call returns Requires a user-provided buffer, provided via MPI_Buffer_attach() Forces an “eager”-like message transfer from sender’s perspective User can wait for completion by calling MPI_Buffer_detach()

MPI_Ssend (Syncronous) (MPI_Issend, MPI_Ssend_init) Won’t return until matching receive is posted Forces a “rendezvous”-like message transfer Can be used to guarantee synchronization without additional MPI_Barrier() calls

MPI_Rsend (Ready) (MPI_Irsend, MPI_Rsend_init) Erroneous if matching receive has not been posted Performance tweak (on some systems) when user can guarantee matching receive is

posted MPI_Isend, MPI_Irecv (Immediate) (MPI_Send_init, MPI_Recv_init)

Non-blocking, immediate return once send/receive request is posted Requires MPI_[Test|Wait][|all|any|some] call to guarantee completion Send/receive buffers should not be touched until completed MPI_Request * argument used for eventual completion

The basic* receive modes are MPI_Recv() and MPI_Irecv(); either can be used to receive any send mode.

Page 48: ISBI MPI Tutorial

Fixing the code/* sr2.c */#include <stdio.h>#include <mpi.h>#ifndef SENDSIZE#define SENDSIZE 1#endif

intmain (int argc, char * argv[] ){

int i, rank, nodes, myData[SENDSIZE], theirData[SENDSIZE];MPI_Status xferStatus[2];MPI_Request xferRequest[2];

MPI_Init(&argc, &argv);

MPI_Comm_size(MPI_COMM_WORLD, &nodes);MPI_Comm_rank(MPI_COMM_WORLD, &rank);

myData[0] = rank;MPI_Isend(myData, SENDSIZE, MPI_INT, (rank + 1) % nodes, 0, MPI_COMM_WORLD, &xferRequest[0]);MPI_Irecv(theirData, SENDSIZE, MPI_INT, (rank + nodes -1) % nodes, 0, MPI_COMM_WORLD, &xferRequest[1]);

MPI_Waitall(2,xferRequest,xferStatus);

printf("%i sent %i; received %i\n", rank, myData[0], theirData[0]);

MPI_Finalize();return 0;

}

Page 49: ISBI MPI Tutorial

Fixed with MPI_I[send|recv]()

$ mpicc -o sr2 sr2.c -DSENDSIZE="0x1<<14”$ mpiexec -np 4 ./sr20 sent 0; received 32 sent 2; received 11 sent 1; received 03 sent 3; received 2

Page 50: ISBI MPI Tutorial

Topics

Motivation for distributed computingWhat MPI is Intro to MPI programmingThinking in parallelWrap up

Page 51: ISBI MPI Tutorial

Types of parallelism [1/2] Task parallelism

Each process handles a unique kind of task▪ Example: multi-image uploader (with resize/recompress)▪ Thread 1: GUI / user interaction▪ Thread 2: file reader & decompression▪ Thread 3: resize & recompression▪ Thread 3: network communication

Can be used in a grid with a pipeline of separable tasks to be performed on each data set▪ Resample / warp volume▪ Segment volume▪ Calculate metrics on segmented volume

Page 52: ISBI MPI Tutorial

Types of parallelism [2/2]

Data parallelism Each process handles a portion of the

entire data Often used with large data sets▪ [task 0… | … task 1 … | … | … task n]

Frequently used in MPI programming Each process is “doing the same thing,”

just on a different subset of the whole

Page 53: ISBI MPI Tutorial

Data layout

xy

z

Node 0Node 1Node 2Node 3Node 4Node 5Node 6Node 7

Layout is crucial in high-performance computing BW efficiency; cache

efficiency Even more important in

distributed Poor layout extra

communication Shown is an example of

“block” data distribution x is contiguous dimension z is slowest dimension Each node has contiguous

portion of z

Page 54: ISBI MPI Tutorial

FTx

Real-time SENSE unfoldingDATA

Place view into correct x-Ky-Kz space (AP & LP)

“Traditional” 2D SENSE Unfold (AP & LP)

Homodyne Correction

GW Correction (Y, Z)

GW Correction (X)

MIP

Display / DICOM

FTyz (AP & LP)CAL

RESULT

Root node

Worker nodes

Real-time data

Pre-loaded data

MPI Communication

Page 55: ISBI MPI Tutorial

Separability

Completely separable problems: Add 1 to everyone Multiply each a[i] * b[i]

Inseparable problems: [?] Max of a vector Sort a vector MIP of a volume 1D FFT of a volume 2d FFT of a volume 3d FFT of a volume

[Parallel sort] Pacheo, Peter S., Parallel Programming with MPI

Page 56: ISBI MPI Tutorial

3D-sinc interpolation

Page 57: ISBI MPI Tutorial

Next steps

Dynamic datatypes MPI_Type_vector() Enables communication of sub-sets without packing Combined with DMA, permits zero-copy transposes, etc.

Other collectives MPI_Reduce MPI_Scatter MPI_Gather

MPI-2 (MPICH2, MVAPICH2) One-sided (DMA) communication▪ MPI_Put()▪ MPI_Get()

Dynamic world size▪ Ability to spawn new processes during run

Page 58: ISBI MPI Tutorial

Topics

Motivation for distributed computingWhat MPI is Intro to MPI programmingThinking in parallelWrap up

Page 59: ISBI MPI Tutorial

Optimizing MPI code

Take time on the algorithm & data layout Minimize traffic between nodes / separate problem▪ FTx into xKyKz in SENSE example

Cache-friendly (linear, efficient) access patternsOverlap processing and communication

MPI_Isend() / MPI_Irecv() with multiple work buffers While actively transferring one, process the other Larger messages will hit a higher BW (in general)

Page 60: ISBI MPI Tutorial

Other MPI / performance thoughts

Profile Vtune (Intel; Linux / Windows) Shark (Mac) MPI profiling with -mpe=mpilog

Avoid “premature optimization” (Knuth) Implementation time & effort vs. runtime

performance Use derived datatypes rather than packing Using a debugger with MPI is hard

Build in your own debugging messages from go

Page 61: ISBI MPI Tutorial

Conclusion

If you might need MPI, build to MPI. Works well in shared memory environments▪ It’s getting better all the time

Encourages memory locality in NUMA architectures▪ Nehalem, AMD

Portable, reusable, open-source Can be used in conjunction with threads /

OpenMP / TBB / CUDA / OpenCL “Hybrid model of parallel programming”

Messaging paradigm can create “less obfuscated” code than threads / OpenMP

Page 62: ISBI MPI Tutorial

Building a cluster

Homogeneous nodes Private network

Shared filesystem; ssh communication Password-less SSH High-bandwidth private interconnect

MPI communication exclusively GbE, 10GbE Infiniband

Consider using Rocks CentOS / RHEL based Built for building clusters Rapid network boot based install/reinstall of nodes http://www.rocksclusters.org/

Page 63: ISBI MPI Tutorial

References

MPI documents http://www.mpi-forum.org/docs/

MPICH2 http://www.mcs.anl.gov/research/projects/mpich2 http://lists.mcs.anl.gov/pipermail/mpich-discuss/

OpenMPI http://www.open-mpi.org/ http://www.open-mpi.org/community/lists/ompi.php

MVAPICH[1|2] (Infiniband-tuned distribution) http://mvapich.cse.ohio-state.edu/ http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/

Rocks http://www.rocksclusters.org/ https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/

Books: Pacheo, Peter S., Parallel Programming with MPI Karniadakis, George E., Parallel Scientific Computing in C++ and MPI Gropp, W., Using MPI-2

Page 64: ISBI MPI Tutorial

Questions?

Page 65: ISBI MPI Tutorial

SIMD Example(Transparency painting)

Page 66: ISBI MPI Tutorial

Transparency painting

This is the painting operation for one RGBA pixel (in) onto another (out)

We can do red and blue together, as we know they won’t collide, and we can mask out the unwanted results.

Post-multiply masks are applied in the shifted position to minimize the number of shift operations

Note: we’re using pre-multiplied colors & painting onto an opaque background

#define RB 0x00FF00FFu#define RB_8OFF 0xFF00FF00u#define RGB 0x00FFFFFFu#define G 0x0000FF00u#define G_8OFF 0x00FF0000u#define A 0xFF000000u

inline void blendPreToStatic(const uint32_t & in, uint32_t & out){ uint32_t alpha = in >> 24; if (alpha & 0x00000080u) ++alpha; out = A | RGB & (in + ( ( (alpha * (out & RB) & RB_8OFF) | (alpha * (out & G) & G_8OFF) ) >> 8 ) );}

Page 67: ISBI MPI Tutorial

Operation in 0 [00][ ][00][ ]x BB RR 0 [00][00][ ][00]x GG out

Load 0 7 00 80 80x F 0 40 50 60xFF

Mask 0 00 40 00 60x 0 00 00 50 00x

Multiply 0 1 0 2 0x F C F A 0 00 27 0 00x B

Mask 0 1 00 2 00x F F 0 00 27 00 00x

OR 0 1 27 2 00x F F

SHIFT 0 00 1 27 2x F F

ADD 0 7 1 7 x F F A AF

Mask 0 00 1 7 x F A AF

OR 0 1 7 xFF F A AF

Code Detail

[R×A][G×A][B×A][1−A]Cout= ′ C2 +C1 ′ α 2

OUT = A | RGB & (IN + ( ( (ALPHA * (OUT & RB) & RB_8OFF) | (ALPHA * (OUT & G) & G_8OFF) ) >> 8 ) );

Page 68: ISBI MPI Tutorial

Vectorizing

For cases where there is no overlap between the four output pixels for four input pixels, we can use vectorized (SSE2) code

128-bit wide registers; load four 32-bit RGBA values, use the same approach as previously (R|B and G) in two registers to perform four paints at once

Page 69: ISBI MPI Tutorial

Vectorizing

inlinevoidblend4PreToStatic(uint32_t ** in, uint32_t * out) // Paints in (quad-word) onto out{ __m128i rb, g, a, a_, o, mask_reg; // Registers rb = _mm_loadu_si128((__m128i *) out); // Load destination (unaligned -- may not be on a 128-bit boundary) a_ = _mm_load_si128((__m128i *) *in); // We make sure the input is on a 128-bit boundary before this call *in += 4; _mm_prefetch((char*) (*in + 28),_MM_HINT_T0); // Fetch the two-cache-line-out memory mask_reg = _mm_set1_epi32(0x0000FF00); // Set green mask (x4) g = _mm_and_si128(rb,mask_reg); // Mask to greens (x4) mask_reg = _mm_set1_epi32(0x00FF00FF); // Set red and blue mask (x4) rb = _mm_and_si128(rb,mask_reg); // Mask to red and blue rb =_mm_slli_epi32(rb,8); // << 8 ; g is already biased by 256 in 16-bit spacing a = _mm_srli_epi32(a_,24); // >> 24 ; These are the four alpha values, shifted to lower 8 bits of each word mask_reg = _mm_slli_epi32(a,16); // << 16 ; A copy of the four alpha values, shifted to bits [16-23] of each word a = _mm_or_si128(a,mask_reg); // We now have the alpha value at both bits [0-7] and [16-23] of each word // These steps add one to transparancy values >= 80 o = _mm_srli_epi16(a,7); // Now the high bit is the low bit a = _mm_add_epi16(a,o);

Page 70: ISBI MPI Tutorial

// We now have 8 16-bit alpha values, and 8 rb or 4 g values. The values are biased by 256, and we want // to muptiply by alpha and then divide by 65536; we achieve this by multiplying the 16-bit values and // storing the upper 16 of the 32 bit result. (This is the operation that is available, so that's why we're // doing it in this fashion!) rb = _mm_mulhi_epu16(rb,a); g = _mm_mulhi_epu16(g,a); g =_mm_slli_epi32(g,8); // Move green into the correct location. // R and B, both the lower 8 bits of their 16 bits, don't need to be shifted o = _mm_set1_epi32(0xFF000000); // Opaque alpha value o = _mm_or_si128(o,g); o = _mm_or_si128(o,rb); // o now has the the background's contribution to the output color

mask_reg = _mm_set1_epi32(0x00FFFFFF); g = _mm_and_si128(mask_reg,a_); // Removes alpha from foreground color o = _mm_add_epi32(o,g); // Add foreground and background contributions together _mm_storeu_si128((__m128i *) out,o); // Unaligned store}

Vectorizing

Page 71: ISBI MPI Tutorial

Vectorizing this code achieves 3-4x speedup on cluster 8x 2x(3.4|3.2GHz) Xeon, 800MHz FSB Render 512x512x409 (400MB) volume in▪ ~22ms (45fps) (SIMD code) ▪ ~92ms (11fps) (Non-vectorized)

~18GB/s memory throughput ~11 cycles / voxel vs. ~45 cycles non-

vectorized

Results

Page 72: ISBI MPI Tutorial

MAN PAGES

Page 73: ISBI MPI Tutorial

MPI_Init() MPI_Init(3) MPI MPI_Init(3)

NAME MPI_Init - Initialize the MPI execution environment

SYNOPSIS int MPI_Init( int *argc, char ***argv )

INPUT PARAMETERS argc - Pointer to the number of arguments argv - Pointer to the argument vector

THREAD AND SIGNAL SAFETY This routine must be called by one thread only. That thread is called the main thread and must be the thread that calls MPI_Finalize .

NOTES The MPI standard does not say what a program can do before an MPI_INIT or after an MPI_FINALIZE . In the MPICH implementation, you should do as little as possible. In particular, avoid anything that changes the external state of the program, such as opening files, reading standard input or writing to standard output.

Page 74: ISBI MPI Tutorial

MPI_Barrier()

MPI_Barrier(3) MPI MPI_Barrier(3)

NAME MPI_Barrier - Blocks until all processes in the communicator have reached this routine.

SYNOPSIS int MPI_Barrier( MPI_Comm comm )

INPUT PARAMETER comm - communicator (handle)

NOTES Blocks the caller until all processes in the communicator have called it; that is, the call returns at any process only after all members of the communicator have entered the call.

Page 75: ISBI MPI Tutorial

MPI_Finalize()

MPI_Finalize(3) MPI MPI_Finalize(3)

NAME MPI_Finalize - Terminates MPI execution environment

SYNOPSIS int MPI_Finalize( void )

NOTES All processes must call this routine before exiting. The number of processes running after this routine is called is undefined; it is best not to perform much more than a return rc after calling MPI_Finalize .

Page 76: ISBI MPI Tutorial

MPI_Comm_size()

MPI_Comm_size(3) MPI MPI_Comm_size(3)

NAME MPI_Comm_size - Determines the size of the group associated with a communicator

SYNOPSIS int MPI_Comm_size( MPI_Comm comm, int *size )

INPUT PARAMETER comm - communicator (handle)

OUTPUT PARAMETER size - number of processes in the group of comm (integer)

Page 77: ISBI MPI Tutorial

MPI_Comm_rank()

MPI_Comm_rank(3) MPI MPI_Comm_rank(3)

NAME MPI_Comm_rank - Determines the rank of the calling process in the com- municator

SYNOPSIS int MPI_Comm_rank( MPI_Comm comm, int *rank )

INPUT ARGUMENT comm - communicator (handle)

OUTPUT ARGUMENT rank - rank of the calling process in the group of comm (integer)

Page 78: ISBI MPI Tutorial

MPI_Send()

MPI_Send(3) MPI MPI_Send(3)

NAME MPI_Send - Performs a blocking send

SYNOPSIS int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)

INPUT PARAMETERS buf - initial address of send buffer (choice) count - number of elements in send buffer (nonnegative integer) datatype - datatype of each send buffer element (handle) dest - rank of destination (integer) tag - message tag (integer) comm - communicator (handle)

NOTES This routine may block until the message is received by the destination process.

Page 79: ISBI MPI Tutorial

MPI_Recv() MPI_Recv(3) MPI MPI_Recv(3)

NAME MPI_Recv - Blocking receive for a message

SYNOPSIS int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)

OUTPUT PARAMETERS buf - initial address of receive buffer (choice) status - status object (Status)

INPUT PARAMETERS count - maximum number of elements in receive buffer (integer) datatype - datatype of each receive buffer element (handle) source - rank of source (integer) tag - message tag (integer) comm - communicator (handle)

NOTES The count argument indicates the maximum length of a message; the actual length of the message can be determined with MPI_Get_count .

Page 80: ISBI MPI Tutorial

MPI_Isend()

MPI_Isend(3) MPI MPI_Isend(3)

NAME MPI_Isend - Begins a nonblocking send

SYNOPSIS int MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)

INPUT PARAMETERS buf - initial address of send buffer (choice) count - number of elements in send buffer (integer) datatype - datatype of each send buffer element (handle) dest - rank of destination (integer) tag - message tag (integer) comm - communicator (handle)

OUTPUT PARAMETER request - communication request (handle)

Page 81: ISBI MPI Tutorial

MPI_Irecv()

MPI_Irecv(3) MPI MPI_Irecv(3)

NAME MPI_Irecv - Begins a nonblocking receive

SYNOPSIS int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request)

INPUT PARAMETERS buf - initial address of receive buffer (choice) count - number of elements in receive buffer (integer) datatype - datatype of each receive buffer element (handle) source - rank of source (integer) tag - message tag (integer) comm - communicator (handle)

OUTPUT PARAMETER request - communication request (handle)

Page 82: ISBI MPI Tutorial

MPI_Bcast()

MPI_Bcast(3) MPI MPI_Bcast(3)

NAME MPI_Bcast - Broadcasts a message from the process with rank "root" to all other processes of the communicator

SYNOPSIS int MPI_Bcast( void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm )

INPUT/OUTPUT PARAMETER buffer - starting address of buffer (choice)

INPUT PARAMETERS count - number of entries in buffer (integer) datatype - data type of buffer (handle) root - rank of broadcast root (integer) comm - communicator (handle)

Page 83: ISBI MPI Tutorial

MPI_Allreduce()

MPI_Allreduce(3) MPI MPI_Allreduce(3)

NAME MPI_Allreduce - Combines values from all processes and distributes the result back to all processes

SYNOPSIS int MPI_Allreduce ( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm )

INPUT PARAMETERS sendbuf - starting address of send buffer (choice) count - number of elements in send buffer (integer) datatype - data type of elements of send buffer (handle) op - operation (handle) comm - communicator (handle)

OUTPUT PARAMETER recvbuf - starting address of receive buffer (choice)

Page 84: ISBI MPI Tutorial

MPI_Type_create_hvector()

MPI_Type_create_hvector(3) MPI MPI_Type_create_hvector(3)

NAME MPI_Type_create_hvector - Create a datatype with a constant stride given in bytes

SYNOPSIS int MPI_Type_create_hvector(int count, int blocklength, MPI_Aint stride, MPI_Datatype oldtype, MPI_Datatype *newtype)

INPUT PARAMETERS count - number of blocks (nonnegative integer) blocklength - number of elements in each block (nonnegative integer) stride - number of bytes between start of each block (address integer) oldtype - old datatype (handle)

OUTPUT PARAMETER newtype - new datatype (handle)

Page 85: ISBI MPI Tutorial

mpicc mpicc(1) MPI mpicc(1)

NAME mpicc - Compiles and links MPI programs written in C

DESCRIPTION This command can be used to compile and link MPI programs written in C. It provides the options and any special libraries that are needed to compile and link MPI programs.

It is important to use this command, particularly when linking pro- grams, as it provides the necessary libraries.

COMMAND LINE ARGUMENTS -show - Show the commands that would be used without runnning them -help - Give short help -cc=name - Use compiler name instead of the default choice. Use this only if the compiler is compatible with the MPICH library (see below) -config=name - Load a configuration file for a particular compiler. This allows a single mpicc command to be used with multiple compil- ers.

[…]

Page 86: ISBI MPI Tutorial

mpiexec mpiexec(1) MPI mpiexec(1)

NAME mpiexec - Run an MPI program

SYNOPSIS mpiexec args executable pgmargs [ : args executable pgmargs ... ]

where args are command line arguments for mpiexec (see below), exe- cutable is the name of an executable MPI program, and pgmargs are com- mand line arguments for the executable. Multiple executables can be specified by using the colon notation (for MPMD - Multiple Program Mul- tiple Data applications). For example, the following command will run the MPI program a.out on 4 processes: mpiexec -n 4 a.out

The MPI standard specifies the following arguments and their meanings:

-n <np> - Specify the number of processes to use -host <hostname> - Name of host on which to run processes -arch <architecture name> - Pick hosts with this architecture type

[…]