isbi mpi tutorial

MPI & Distributed Computing

Eric Borisch, M.S.Mayo Clinic

Topics

Motivation for distributed computingWhat MPI is Intro to MPI programmingThinking in parallelWrap up

Shared vs. Distributed Memory Shared Memory: all memory within a system is

directly addressable (ignoring access restrictions) by each process [or thread] Single- and multi-CPU desktops & laptops Multi-threaded apps GPGPU * MPI *

Distributed Memory: memory available a given node within a system is unique and distinct from its peers MPI Google MapReduce / Hadoop

Why bother?

1 2 3 4 5 6 7 80

Centos 5.2; Dual Quad-Core 3GHz P4 [E5472]; DDR2 800MHz

CopyScaleAddTriad

# of processes

tive p

http://www.cs.virginia.edu/stream/

But what about Nehalem?

0 4 8 12 160%

50%100%150%200%250%300%350%400%

STREAM benchmark OpenMP per-formance

Add:Copy:Scale:Triad:

Threads (8 Physical cores + HT)

tive p

http://www.cs.virginia.edu/stream/2x X5570 (2.93GHz; Quad-core; 6.4GT/s QPI); 12x4G 1033 DDR3

Memory Limitations

Bandwidth (FSB, HT, Nehalem, CUDA, …) Frequently run into with high-level languages (MATLAB)

Capacity – cost & availability High-density chips are $$$ (if even available) Memory limits on individual systems

Distributed computing addresses both bandwidth and capacity with multiple systems

MPI is the glue used to connect multiple distributed processes together

Memory Requirements [Example]

Custom iterative SENSE reconstruction 3 x 8 coils x 400 x 320 x 176 x 8 [complex

float] Profile data (img space) Estimate (img <-> k space) Acquired data (k space) > 4GB data touched during each iteration

16, 32 channel data here or on the way…

Trzasko, Josh ISBI 2009 #1349, “Practical Nonconvex Compressive Sensing Reconstruction of Highly-Accelerated 3D Parallel MR Angiograms”M01.R4: MRI Reconstruction Algorithms, Monday @ 1045 in Stanbro

Real-time SENSE unfoldingDATA

Place view into correct x-Ky-Kz space (AP & LP)

“Traditional” 2D SENSE Unfold (AP & LP)

Homodyne Correction

GW Correction (Y, Z)

GW Correction (X)

Store / DICOM

FTyz (AP & LP)CAL

RESULT

Root node

Worker nodes

Real-time data

Pre-loaded data

MPI Communication

Root Node

3.6GHz P4

16GB RAM

80GB HDD

1Gb Eth

2x8Gb IB

Worker Node (x7)3.6GHz P4

3.6GHz P4

16GB RAM

80GB HDD2x8Gb IB

24-Port Infiniband Switch

16-Port Gigabit Ethernet Switch

1Gb Eth

MRI System

Site Intranet

x7x2 MPI interconnects16Gb/s bandwidth per node

x7 File system connections

KeyCluster Hardware

External Hardware

2x8Gig Infiniband connection

1Gig Ethernet connection

500GB HDD

1Gb Eth

8Gb/s Connection

MRI Reconstruction Cluster

Many Approaches to “Distributed”

Loosely coupled SETI / BOINC “Grid computing”

BIOS-level abstraction ScaleMP

Tightly coupled MPI “Cluster computing”

Hybrid Folding@Home gpugrid.net

http://en.wikipedia.org/wiki/File:After_Dark_Flying_Toasters.pnghttp://en.wikipedia.org/wiki/File:Setiathomeversion3point08.png

Grid vs. Cluster

Master

Worker

WorkerHead Node

WorkerNode

Interconnect

Shared vs. Distributed

HostOS

Process A

Process B

Thread 1

Thread 2

Thread N

Host IOS I

Process A

Host IIOS II

Process B

Host NOS N

Process CMemory Transfers

Network Transfers

Shared vs. Distributed

HostOS

Process A

Process B

Thread 1

Thread 2

Thread N

Host IOS I

Process A

Host IIOS II

Process B

Host NOS N

Process CMemory Transfers

Network Transfers

Process D

Process E

Process F

Topics

Message Passing Interface is… “a library specification for message-

passing” 1

Available in many implementations on multiple platforms *

A set of functions for moving messages between different processes without a shared memory environment

Low-level*; no concept of overall computing tasks to be performed

[1] http://www.mcs.anl.gov/research/projects/mpi/

MPI history

MPI-1 Version 1.0 draft standard 1994 Version 1.1 in 1995 Version 1.2 in 1997 Version 1.3 in 2008

MPI-2 Added:▪ 1-sided communication▪ Dynamic “world” sizes; spawn / join

Version 2.0 in 1997 Version 2.1 in 2008

MPI-3 In process Enhanced fault handling

Forward compatibility preserved

MPI Status

MPI is the de-facto standard for distributed computing Freely available Open source implementations exist Portable Mature

From a discussion of why MPI is dominant [1]: […] 100s of languages have come and gone. Good stuff must have been created [… yet] it is broadly accepted in the

field that they’re not used. MPI has a lock. OpenMP is accepted, but a distant second. There are substantial barriers to the introduction of new languages and

language constructs. Economic, ecosystem related, psychological, a catch-22 of widespread

use, etc. Any parallel language proposal must come equipped with reasons why it

will overcome those barriers.[1] http://www.ieeetcsc.org/newsletters/2006-01/why_all_mpi_discussion.html

MPI Distributions

MPI itself is just a specification. We want an implementation MPICH, MPICH2

Widely portable MVAPICH, MVAPICH2

Infiniband-centric; MPICH/MPICH2 based OpenMPI

Plug-in architecture; many run-time options And more:

IntelMPI HP-MPI MPI for IBM Blue Gene MPI for Cray Microsoft MPI MPI for SiCortex MPI for Myrinet Express (MX) MPICH2 over SCTP

Implementing a distributed system

Without MPI: Start all of the processes across bank of

machines (shell scripting + ssh) socket(), bind(), listen(), accept() or

connect() each link send(), read() on individual links Raw byte interfaces; no discrete

messages

Implementing a distributed system

With MPI mpiexec –np <n> app MPI_Init() MPI_Send() MPI_Recv() MPI_Finalize()

MPI: Manages the connections Packages messages Provides launching mechanism

MPI (the document)1

Provides definitions for: Communication functions

MPI_Send() MPI_Recv() MPI_Bcast() etc.

Datatype mangement functions MPI_Type_create_hvector()

C, C++, and Fortran bindings Also recommends process startup

mpiexec –np <nproc> <program> <args>[1] http://www.mpi-forum.org/docs/mpi-20-html/mpi2-report.html

MPI FunctionsMPI_AbortMPI_AccumulateMPI_Add_error_classMPI_Add_error_codeMPI_Add_error_stringMPI_AddressMPI_AllgatherMPI_AllgathervMPI_Alloc_memMPI_AllreduceMPI_AlltoallMPI_AlltoallvMPI_AlltoallwMPI_Attr_deleteMPI_Attr_getMPI_Attr_putMPI_BarrierMPI_BcastMPI_BsendMPI_Bsend_initMPI_Buffer_attachMPI_Buffer_detachMPI_CancelMPI_Cart_coordsMPI_Cart_createMPI_Cart_getMPI_Cart_mapMPI_Cart_rankMPI_Cart_shiftMPI_Cart_subMPI_Cartdim_getMPI_Close_portMPI_Comm_acceptMPI_Comm_call_errhandlerMPI_Comm_compareMPI_Comm_connectMPI_Comm_createMPI_Comm_create_errhandlerMPI_Comm_create_keyvalMPI_Comm_delete_attrMPI_Comm_disconnectMPI_Comm_dupMPI_Comm_freeMPI_Comm_free_keyvalMPI_Comm_get_attrMPI_Comm_get_errhandlerMPI_Comm_get_nameMPI_Comm_get_parentMPI_Comm_groupMPI_Comm_joinMPI_Comm_rankMPI_Comm_remote_groupMPI_Comm_remote_sizeMPI_Comm_set_attrMPI_Comm_set_errhandlerMPI_Comm_set_nameMPI_Comm_sizeMPI_Comm_spawnMPI_Comm_spawn_multipleMPI_Comm_splitMPI_Comm_test_interMPI_Dims_createMPI_Errhandler_createMPI_Errhandler_freeMPI_Errhandler_getMPI_Errhandler_setMPI_Error_classMPI_Error_stringMPI_ExscanMPI_File_c2fMPI_File_call_errhandlerMPI_File_closeMPI_File_create_errhandlerMPI_File_deleteMPI_File_f2cMPI_File_get_amodeMPI_File_get_atomicityMPI_File_get_byte_offsetMPI_File_get_errhandlerMPI_File_get_groupMPI_File_get_infoMPI_File_get_positionMPI_File_get_position_sharedMPI_File_get_sizeMPI_File_get_type_extent

MPI_File_get_viewMPI_File_ireadMPI_File_iread_atMPI_File_iread_sharedMPI_File_iwriteMPI_File_iwrite_atMPI_File_iwrite_sharedMPI_File_openMPI_File_preallocateMPI_File_readMPI_File_read_allMPI_File_read_all_beginMPI_File_read_all_endMPI_File_read_atMPI_File_read_at_allMPI_File_read_at_all_beginMPI_File_read_at_all_endMPI_File_read_orderedMPI_File_read_ordered_beginMPI_File_read_ordered_endMPI_File_read_sharedMPI_File_seekMPI_File_seek_sharedMPI_File_set_atomicityMPI_File_set_errhandlerMPI_File_set_infoMPI_File_set_sizeMPI_File_set_viewMPI_File_syncMPI_File_writeMPI_File_write_allMPI_File_write_all_beginMPI_File_write_all_endMPI_File_write_atMPI_File_write_at_allMPI_File_write_at_all_beginMPI_File_write_at_all_endMPI_File_write_orderedMPI_File_write_ordered_beginMPI_File_write_ordered_endMPI_File_write_sharedMPI_FinalizeMPI_FinalizedMPI_Free_memMPI_GatherMPI_GathervMPI_GetMPI_Get_addressMPI_Get_countMPI_Get_elementsMPI_Get_processor_nameMPI_Get_versionMPI_Graph_createMPI_Graph_getMPI_Graph_mapMPI_Graph_neighborsMPI_Graph_neighbors_countMPI_Graphdims_getMPI_Grequest_completeMPI_Grequest_startMPI_Group_compareMPI_Group_differenceMPI_Group_exclMPI_Group_freeMPI_Group_inclMPI_Group_intersectionMPI_Group_range_exclMPI_Group_range_incl

MPI_Group_rankMPI_Group_sizeMPI_Group_translate_ranksMPI_Group_unionMPI_IbsendMPI_Info_createMPI_Info_deleteMPI_Info_dupMPI_Info_freeMPI_Info_getMPI_Info_get_nkeysMPI_Info_get_nthkeyMPI_Info_get_valuelenMPI_Info_setMPI_InitMPI_Init_threadMPI_InitializedMPI_Intercomm_createMPI_Intercomm_mergeMPI_IprobeMPI_IrecvMPI_IrsendMPI_Is_thread_mainMPI_IsendMPI_IssendMPI_Keyval_createMPI_Keyval_freeMPI_Lookup_nameMPI_Op_createMPI_Op_freeMPI_Open_portMPI_PackMPI_Pack_externalMPI_Pack_external_sizeMPI_Pack_sizeMPI_PcontrolMPI_ProbeMPI_Publish_nameMPI_PutMPI_Query_threadMPI_RecvMPI_Recv_initMPI_ReduceMPI_Reduce_scatterMPI_Register_datarepMPI_Request_freeMPI_Request_get_statusMPI_RsendMPI_Rsend_initMPI_ScanMPI_ScatterMPI_ScattervMPI_SendMPI_Send_initMPI_SendrecvMPI_Sendrecv_replaceMPI_SsendMPI_Ssend_initMPI_StartMPI_StartallMPI_Status_set_cancelledMPI_Status_set_elementsMPI_TestMPI_Test_cancelledMPI_TestallMPI_TestanyMPI_TestsomeMPI_Topo_testMPI_Type_commitMPI_Type_contiguousMPI_Type_create_darrayMPI_Type_create_hindexedMPI_Type_create_hvectorMPI_Type_create_indexed_block

MPI_Type_create_keyvalMPI_Type_create_resizedMPI_Type_create_structMPI_Type_create_subarrayMPI_Type_delete_attrMPI_Type_dupMPI_Type_extentMPI_Type_freeMPI_Type_free_keyvalMPI_Type_get_attrMPI_Type_get_contentsMPI_Type_get_envelopeMPI_Type_get_extentMPI_Type_get_nameMPI_Type_get_true_extentMPI_Type_hindexedMPI_Type_hvectorMPI_Type_indexedMPI_Type_lbMPI_Type_match_sizeMPI_Type_set_attrMPI_Type_set_nameMPI_Type_sizeMPI_Type_structMPI_Type_ubMPI_Type_vectorMPI_UnpackMPI_Unpack_externalMPI_Unpublish_nameMPI_WaitMPI_WaitallMPI_WaitanyMPI_WaitsomeMPI_Win_call_errhandlerMPI_Win_completeMPI_Win_createMPI_Win_create_errhandlerMPI_Win_create_keyvalMPI_Win_delete_attrMPI_Win_fenceMPI_Win_freeMPI_Win_free_keyvalMPI_Win_get_attrMPI_Win_get_errhandlerMPI_Win_get_groupMPI_Win_get_nameMPI_Win_lockMPI_Win_postMPI_Win_set_attrMPI_Win_set_errhandlerMPI_Win_set_nameMPI_Win_startMPI_Win_testMPI_Win_unlockMPI_Win_waitMPI_WtickMPI_Wtime

The message passing mindset Each process owns their data – there is no “our”

Makes many things simpler; no mutexes, condition variables, semaphores, etc; memory access order race conditions go away

Every message is an explicit copy I have the memory I sent from, you have the memory

you used to received into Even when running in a “shared memory” environment

Synchronization comes along for free I won’t get your message (or data) until you choose to

send it Programming to MPI first can make it easier to

scale-out later

Topics

Getting started with MPI

Download / decompress MPICH source: http://www.mcs.anl.gov/research/projects/mpich2/ Suports: c / c++ / Fortran Requires Python >= 2.2

./configure make install

installs into /usr/local by default, or use --prefix=<chosen path>

Make sure <prefix>/bin is in PATH Make sure <prefix>/share/man is in

MANPATH

MPI Installation

c compiler wrapper c++ compiler wrapper

MPI job launcher

MPD launcher

MPD launch

Set up passwordless ssh to workers Start the daemons with mpdboot -n <N>

Requires ~/.mpd.conf to exist on each host▪ Contains: (same on each host)▪ MPD_SECRETWORD=<some gibberish string>

▪ permissions set to 600 (r/w access for owner only) Requires ./mpd.hosts to list other host names▪ Unless run as mpdboot -n 1 (run on current host

only)▪ Will not accept current host in list (implicit)

Check for running daemons with mpdtrace

For details: http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-1.0.8-installguide.pdf

MPD launch

MPI Compile & launch

Use mpicc / mpicxx for c/c++ compiler Wrapper script around c/c++ compilers

detected during install▪ $ mpicc --showgcc -march=core2 -I/usr/local/builds/mpich2-1.0.8/include -L/usr/local/builds/mpich2-1.0.8/lib -lmpich -lhdf5_cpp -lhdf5 -lpthread -luuid -lpthread –lrt

$ mpicc -o hello hello.c Use mpiexec -np <nproc> <app>

<args> to launch $ mpiexec -np 4 ./hello

Hello, Hello, Hello, world world world

/* hello.c */#include <stdio.h>#include <mpi.h>

int main (int argc, char * argv[]){

int i, rank, nodes;

MPI_Init(&argc, &argv);

MPI_Comm_size(MPI_COMM_WORLD, &nodes);MPI_Comm_rank(MPI_COMM_WORLD, &rank);

for (i=0; i < nodes; i++){

MPI_Barrier(MPI_COMM_WORLD);if (i == rank) printf("Hello from %i of %i!\n", rank, nodes);

}MPI_Finalize();return 0;

$ mpicc -o hello hello.c$ mpiexec -np 4 ./hello

Hello, from 0 of 4!Hello, from 2 of 4!Hello, from 1 of 4!Hello, from 3 of 4!

Threads vs. MPI startup./

threaded_app

main()

pthread_create( func() ) func()

pthread_exit()pthread_join()

Memory

Do work Do work

exit()

Thread within threaded_app process

mpi_app [rank 3]

mpi_app [rank 0]

mpi_app [rank 1]

Threads vs. MPI startupmpiexec –np 4

./mpi_app

main() main() main()

MPI comm.MPI_Init() MPI_Init() MPI_Init()

MPI comm.MPI_Bcast() MPI_Bcast() MIP_Bcast()

Do Work on local mem

mpd launches jobs

MPI comm.MPI_Allreduce() MPI_Allreduce() MPI_Allreduce()

MPI comm.MPI_Finalize() MPI_Finalize() MPI_Finalize()

exit() exit() exit()

Hello, world: unique to ranks

/* hello.c */#include <stdio.h>#include <mpi.h>

intmain (int argc, char * argv[]){

int i;int rank;int nodes;

for (i=0; i < nodes; i++){MPI_Barrier(MPI_COMM_WORLD);if (i == rank) printf("Hello from %i of %i!\n", rank, nodes);}MPI_Finalize();return 0;

MPE (Multi-Process Environment)

MPICH2 comes with mpe by default (unless disabled during configure)

Multiple tracing / logging options to track MPI traffic

Enabled through –mpe=<option> at compile timeMacPro:code$ mpicc -mpe=mpilog -o hello hello.cMacPro:code$ mpiexec -np 4 ./helloHello from 0 of 4!Hello from 2 of 4!Hello from 1 of 4!Hello from 3 of 4!Writing logfile....Enabling the Default clock synchronization...Finished writing logfile ./hello.clog2.

jumpshot view of log

Output with -mpe=mpitraceMacPro:code$ mpicc -mpe=mpitrace -o hello hello.cMacPro:code$ mpiexec -np 2 ./hello > trace

MacPro:code$ grep 0 trace [0] Ending MPI_Init[0] Starting MPI_Comm_size...[0] Ending MPI_Comm_size[0] Starting MPI_Comm_rank...[0] Ending MPI_Comm_rank[0] Starting MPI_Barrier...[0] Ending MPI_BarrierHello from 0 of 2![0] Starting MPI_Barrier...[0] Ending MPI_Barrier[0] Starting MPI_Finalize...[0] Ending MPI_Finalize

MacPro:code$ grep 1 trace [1] Ending MPI_Init[1] Starting MPI_Comm_size...[1] Ending MPI_Comm_size[1] Starting MPI_Comm_rank...[1] Ending MPI_Comm_rank[1] Starting MPI_Barrier...[1] Ending MPI_Barrier[1] Starting MPI_Barrier...[1] Ending MPI_BarrierHello from 1 of 2![1] Starting MPI_Finalize...[1] Ending MPI_Finalize

A more interesting log…

3D-sinc interpolation

MPI_Send (Blocking)

int MPI_Send( void *buf,

memory location to send from

int count,number of elements (of type datatype) at buf

MPI_Datatype datatype, MPI_INT, MPI_FLOAT, etc…Or custom datatypes; strided vectors; structures, etc

int dest, rank (within the communicator comm) of destination for this message

int tag, used to distinguish this message from other messages

MPI_Comm comm )communicator for this transferoften MPI_COMM_WORLD

MPI_Recv (Blocking)

int MPI_Recv(void *buf,

memory location to receive data into

int count, number of elements (of type datatype) available to receive into at buf

MPI_Datatype datatype,MPI_INT, MPI_FLOAT, etc…Or custom datatypes; strided vectors; structures, etc.Typically matches sending datatype, but doesn’t have to…

int source, rank (within the communicator comm) of source for this messagecan also be MPI_ANY_SOURCE

int tag, used to distinguish this message from other messagescan also be MPI_ANY_TAG

MPI_Comm comm, communicator for this transferoften MPI_COMM_WORLD

MPI_Status *status )Structure describing the received message, including:

actual count (can be smaller than passed count)source (useful if used with source = MPI_ANY_SOURCE)tag (useful if used with tag = MPI_ANY_TAG)

Another example

/* sr.c */#include <stdio.h>#include <mpi.h>#ifndef SENDSIZE#define SENDSIZE 1#endif

intmain (int argc, char * argv[] ){

int i, rank, nodes, myData[SENDSIZE], theirData[SENDSIZE];MPI_Status sendStatus;

myData[0] = rank;MPI_Send(myData, SENDSIZE, MPI_INT, (rank + 1) % nodes, 0, MPI_COMM_WORLD);MPI_Recv(theirData, SENDSIZE, MPI_INT, (rank + nodes -1 ) % nodes, 0, MPI_COMM_WORLD, &sendStatus);

printf("%i sent %i; received %i\n", rank, myData[0], theirData[0]);

MPI_Finalize();return 0;

Does it run?

$ mpicc -o sr sr.c$ mpiexec -np 2 ./sr0 sent 0; received 11 sent 1; received 0

Log output (-np 4)

May != Will

$ mpicc -o sr sr.c$ mpiexec -np 2 ./sr0 sent 0; received 11 sent 1; received 0

$ mpicc -o sr sr.c -DSENDSIZE="0x1<<13”$ mpiexec -np 2 ./sr0 sent 0; received 11 sent 1; received 0

$ mpicc -o sr sr.c -DSENDSIZE="0x1<<14”$ mpiexec -np 2 ./sr^C

$ mpicc -o sr sr.c -DSENDSIZE="0x1<<14 - 1”$ mpiexec -np 2 ./sr0 sent 0; received 11 sent 1; received 0

What the standard has to say…3.4 Communication Modes

The send call described in Section Blocking send is blocking: it does not return until the message data and envelope have been safely stored away so that the sender is free to access and overwrite the send buffer. The message might be copied directly into the matching receive buffer, or it might be copied into a temporary system buffer.

Message buffering decouples the send and receive operations. A blocking send can complete as soon as the message was buffered, even if no matching receive has been executed by the receiver. On the other hand, message buffering can be expensive, as it entails additional memory-to-memory copying, and it requires the allocation of memory for buffering. MPI offers the choice of several communication modes that allow one to control the choice of the communication protocol.

The send call described in Section Blocking send used the standard communication mode. In this mode, it is up to MPI to decide whether outgoing messages will be buffered. MPI may buffer outgoing messages. In such a case, the send call may complete before a matching receive is invoked. On the other hand, buffer space may be unavailable, or MPI may choose not to buffer outgoing messages, for performance reasons. In this case, the send call will not complete until a matching receive has been posted, and the data has been moved to the receiver.

Thus, a send in standard mode can be started whether or not a matching receive has been posted. It may complete before a matching receive is posted. The standard mode send is non-local: successful completion of the send operation may depend on the occurrence of a matching receive.

http://www.mpi-forum.org/docs/mpi-11-html/node40.html#Node40

Rendezvous vs. eager (simplified)

Process 1 Process 2Send

“small” message &

returnEager send

Eager recv

Request & receive small

message

Send “large”

messageRndv. req.

Request large message

Receive Rndv. req.

Match Rndv. req.

Rndv. send

Receive Rndv. data

Receive large message

Blocks until completion.

User activityMPI activity

MPI communication modes MPI_Bsend (Buffered) (MPI_Ibsend, MPI_Bsend_init)

Sends are “local” – they return independent of any remote activity Message buffer can be touched immediately after call returns Requires a user-provided buffer, provided via MPI_Buffer_attach() Forces an “eager”-like message transfer from sender’s perspective User can wait for completion by calling MPI_Buffer_detach()

MPI_Ssend (Syncronous) (MPI_Issend, MPI_Ssend_init) Won’t return until matching receive is posted Forces a “rendezvous”-like message transfer Can be used to guarantee synchronization without additional MPI_Barrier() calls

MPI_Rsend (Ready) (MPI_Irsend, MPI_Rsend_init) Erroneous if matching receive has not been posted Performance tweak (on some systems) when user can guarantee matching receive is

posted MPI_Isend, MPI_Irecv (Immediate) (MPI_Send_init, MPI_Recv_init)

Non-blocking, immediate return once send/receive request is posted Requires MPI_[Test|Wait][|all|any|some] call to guarantee completion Send/receive buffers should not be touched until completed MPI_Request * argument used for eventual completion

The basic* receive modes are MPI_Recv() and MPI_Irecv(); either can be used to receive any send mode.

Fixing the code/* sr2.c */#include <stdio.h>#include <mpi.h>#ifndef SENDSIZE#define SENDSIZE 1#endif

intmain (int argc, char * argv[] ){

int i, rank, nodes, myData[SENDSIZE], theirData[SENDSIZE];MPI_Status xferStatus[2];MPI_Request xferRequest[2];

myData[0] = rank;MPI_Isend(myData, SENDSIZE, MPI_INT, (rank + 1) % nodes, 0, MPI_COMM_WORLD, &xferRequest[0]);MPI_Irecv(theirData, SENDSIZE, MPI_INT, (rank + nodes -1) % nodes, 0, MPI_COMM_WORLD, &xferRequest[1]);

MPI_Waitall(2,xferRequest,xferStatus);

printf("%i sent %i; received %i\n", rank, myData[0], theirData[0]);

MPI_Finalize();return 0;

Fixed with MPI_I[send|recv]()

$ mpicc -o sr2 sr2.c -DSENDSIZE="0x1<<14”$ mpiexec -np 4 ./sr20 sent 0; received 32 sent 2; received 11 sent 1; received 03 sent 3; received 2

Topics

Types of parallelism [1/2] Task parallelism

Each process handles a unique kind of task▪ Example: multi-image uploader (with resize/recompress)▪ Thread 1: GUI / user interaction▪ Thread 2: file reader & decompression▪ Thread 3: resize & recompression▪ Thread 3: network communication

Can be used in a grid with a pipeline of separable tasks to be performed on each data set▪ Resample / warp volume▪ Segment volume▪ Calculate metrics on segmented volume

Types of parallelism [2/2]

Data parallelism Each process handles a portion of the

entire data Often used with large data sets▪ [task 0… | … task 1 … | … | … task n]

Frequently used in MPI programming Each process is “doing the same thing,”

just on a different subset of the whole

Data layout

Node 0Node 1Node 2Node 3Node 4Node 5Node 6Node 7

Layout is crucial in high-performance computing BW efficiency; cache

efficiency Even more important in

distributed Poor layout extra

communication Shown is an example of

“block” data distribution x is contiguous dimension z is slowest dimension Each node has contiguous

portion of z

Real-time SENSE unfoldingDATA

Place view into correct x-Ky-Kz space (AP & LP)

“Traditional” 2D SENSE Unfold (AP & LP)

Homodyne Correction

GW Correction (Y, Z)

GW Correction (X)

Display / DICOM

FTyz (AP & LP)CAL

RESULT

Root node

Worker nodes

Real-time data

Pre-loaded data

MPI Communication

Separability

Completely separable problems: Add 1 to everyone Multiply each a[i] * b[i]

Inseparable problems: [?] Max of a vector Sort a vector MIP of a volume 1D FFT of a volume 2d FFT of a volume 3d FFT of a volume

[Parallel sort] Pacheo, Peter S., Parallel Programming with MPI

3D-sinc interpolation

Next steps

Dynamic datatypes MPI_Type_vector() Enables communication of sub-sets without packing Combined with DMA, permits zero-copy transposes, etc.

Other collectives MPI_Reduce MPI_Scatter MPI_Gather

MPI-2 (MPICH2, MVAPICH2) One-sided (DMA) communication▪ MPI_Put()▪ MPI_Get()

Dynamic world size▪ Ability to spawn new processes during run

Topics

Optimizing MPI code

Take time on the algorithm & data layout Minimize traffic between nodes / separate problem▪ FTx into xKyKz in SENSE example

Cache-friendly (linear, efficient) access patternsOverlap processing and communication

MPI_Isend() / MPI_Irecv() with multiple work buffers While actively transferring one, process the other Larger messages will hit a higher BW (in general)

Other MPI / performance thoughts

Profile Vtune (Intel; Linux / Windows) Shark (Mac) MPI profiling with -mpe=mpilog

Avoid “premature optimization” (Knuth) Implementation time & effort vs. runtime

performance Use derived datatypes rather than packing Using a debugger with MPI is hard

Build in your own debugging messages from go

Conclusion

If you might need MPI, build to MPI. Works well in shared memory environments▪ It’s getting better all the time

Encourages memory locality in NUMA architectures▪ Nehalem, AMD

Portable, reusable, open-source Can be used in conjunction with threads /

OpenMP / TBB / CUDA / OpenCL “Hybrid model of parallel programming”

Messaging paradigm can create “less obfuscated” code than threads / OpenMP

Building a cluster

Homogeneous nodes Private network

Shared filesystem; ssh communication Password-less SSH High-bandwidth private interconnect

MPI communication exclusively GbE, 10GbE Infiniband

Consider using Rocks CentOS / RHEL based Built for building clusters Rapid network boot based install/reinstall of nodes http://www.rocksclusters.org/

References

MPI documents http://www.mpi-forum.org/docs/

MPICH2 http://www.mcs.anl.gov/research/projects/mpich2 http://lists.mcs.anl.gov/pipermail/mpich-discuss/

OpenMPI http://www.open-mpi.org/ http://www.open-mpi.org/community/lists/ompi.php

MVAPICH[1|2] (Infiniband-tuned distribution) http://mvapich.cse.ohio-state.edu/ http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/

Rocks http://www.rocksclusters.org/ https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/

Books: Pacheo, Peter S., Parallel Programming with MPI Karniadakis, George E., Parallel Scientific Computing in C++ and MPI Gropp, W., Using MPI-2

Questions?

SIMD Example(Transparency painting)

Transparency painting

This is the painting operation for one RGBA pixel (in) onto another (out)

We can do red and blue together, as we know they won’t collide, and we can mask out the unwanted results.

Post-multiply masks are applied in the shifted position to minimize the number of shift operations

Note: we’re using pre-multiplied colors & painting onto an opaque background

#define RB 0x00FF00FFu#define RB_8OFF 0xFF00FF00u#define RGB 0x00FFFFFFu#define G 0x0000FF00u#define G_8OFF 0x00FF0000u#define A 0xFF000000u

inline void blendPreToStatic(const uint32_t & in, uint32_t & out){ uint32_t alpha = in >> 24; if (alpha & 0x00000080u) ++alpha; out = A | RGB & (in + ( ( (alpha * (out & RB) & RB_8OFF) | (alpha * (out & G) & G_8OFF) ) >> 8 ) );}

Operation in 0 [00][ ][00][ ]x BB RR 0 [00][00][ ][00]x GG out

Load 0 7 00 80 80x F 0 40 50 60xFF

Mask 0 00 40 00 60x 0 00 00 50 00x

Multiply 0 1 0 2 0x F C F A 0 00 27 0 00x B

Mask 0 1 00 2 00x F F 0 00 27 00 00x

OR 0 1 27 2 00x F F

SHIFT 0 00 1 27 2x F F

ADD 0 7 1 7 x F F A AF

Mask 0 00 1 7 x F A AF

OR 0 1 7 xFF F A AF

Code Detail

[R×A][G×A][B×A][1−A]Cout= ′ C2 +C1 ′ α 2

OUT = A | RGB & (IN + ( ( (ALPHA * (OUT & RB) & RB_8OFF) | (ALPHA * (OUT & G) & G_8OFF) ) >> 8 ) );

Vectorizing

For cases where there is no overlap between the four output pixels for four input pixels, we can use vectorized (SSE2) code

128-bit wide registers; load four 32-bit RGBA values, use the same approach as previously (R|B and G) in two registers to perform four paints at once

Vectorizing

inlinevoidblend4PreToStatic(uint32_t ** in, uint32_t * out) // Paints in (quad-word) onto out{ __m128i rb, g, a, a_, o, mask_reg; // Registers rb = _mm_loadu_si128((__m128i *) out); // Load destination (unaligned -- may not be on a 128-bit boundary) a_ = _mm_load_si128((__m128i *) *in); // We make sure the input is on a 128-bit boundary before this call *in += 4; _mm_prefetch((char*) (*in + 28),_MM_HINT_T0); // Fetch the two-cache-line-out memory mask_reg = _mm_set1_epi32(0x0000FF00); // Set green mask (x4) g = _mm_and_si128(rb,mask_reg); // Mask to greens (x4) mask_reg = _mm_set1_epi32(0x00FF00FF); // Set red and blue mask (x4) rb = _mm_and_si128(rb,mask_reg); // Mask to red and blue rb =_mm_slli_epi32(rb,8); // << 8 ; g is already biased by 256 in 16-bit spacing a = _mm_srli_epi32(a_,24); // >> 24 ; These are the four alpha values, shifted to lower 8 bits of each word mask_reg = _mm_slli_epi32(a,16); // << 16 ; A copy of the four alpha values, shifted to bits [16-23] of each word a = _mm_or_si128(a,mask_reg); // We now have the alpha value at both bits [0-7] and [16-23] of each word // These steps add one to transparancy values >= 80 o = _mm_srli_epi16(a,7); // Now the high bit is the low bit a = _mm_add_epi16(a,o);

// We now have 8 16-bit alpha values, and 8 rb or 4 g values. The values are biased by 256, and we want // to muptiply by alpha and then divide by 65536; we achieve this by multiplying the 16-bit values and // storing the upper 16 of the 32 bit result. (This is the operation that is available, so that's why we're // doing it in this fashion!) rb = _mm_mulhi_epu16(rb,a); g = _mm_mulhi_epu16(g,a); g =_mm_slli_epi32(g,8); // Move green into the correct location. // R and B, both the lower 8 bits of their 16 bits, don't need to be shifted o = _mm_set1_epi32(0xFF000000); // Opaque alpha value o = _mm_or_si128(o,g); o = _mm_or_si128(o,rb); // o now has the the background's contribution to the output color

mask_reg = _mm_set1_epi32(0x00FFFFFF); g = _mm_and_si128(mask_reg,a_); // Removes alpha from foreground color o = _mm_add_epi32(o,g); // Add foreground and background contributions together _mm_storeu_si128((__m128i *) out,o); // Unaligned store}

Vectorizing

Vectorizing this code achieves 3-4x speedup on cluster 8x 2x(3.4|3.2GHz) Xeon, 800MHz FSB Render 512x512x409 (400MB) volume in▪ ~22ms (45fps) (SIMD code) ▪ ~92ms (11fps) (Non-vectorized)

~18GB/s memory throughput ~11 cycles / voxel vs. ~45 cycles non-

vectorized

Results

MAN PAGES

MPI_Init() MPI_Init(3) MPI MPI_Init(3)

NAME MPI_Init - Initialize the MPI execution environment

SYNOPSIS int MPI_Init( int *argc, char ***argv )

INPUT PARAMETERS argc - Pointer to the number of arguments argv - Pointer to the argument vector

THREAD AND SIGNAL SAFETY This routine must be called by one thread only. That thread is called the main thread and must be the thread that calls MPI_Finalize .

NOTES The MPI standard does not say what a program can do before an MPI_INIT or after an MPI_FINALIZE . In the MPICH implementation, you should do as little as possible. In particular, avoid anything that changes the external state of the program, such as opening files, reading standard input or writing to standard output.

MPI_Barrier()

MPI_Barrier(3) MPI MPI_Barrier(3)

NAME MPI_Barrier - Blocks until all processes in the communicator have reached this routine.

SYNOPSIS int MPI_Barrier( MPI_Comm comm )

INPUT PARAMETER comm - communicator (handle)

NOTES Blocks the caller until all processes in the communicator have called it; that is, the call returns at any process only after all members of the communicator have entered the call.

MPI_Finalize()

MPI_Finalize(3) MPI MPI_Finalize(3)

NAME MPI_Finalize - Terminates MPI execution environment

SYNOPSIS int MPI_Finalize( void )

NOTES All processes must call this routine before exiting. The number of processes running after this routine is called is undefined; it is best not to perform much more than a return rc after calling MPI_Finalize .

MPI_Comm_size()

MPI_Comm_size(3) MPI MPI_Comm_size(3)

NAME MPI_Comm_size - Determines the size of the group associated with a communicator

SYNOPSIS int MPI_Comm_size( MPI_Comm comm, int *size )

INPUT PARAMETER comm - communicator (handle)

OUTPUT PARAMETER size - number of processes in the group of comm (integer)

MPI_Comm_rank()

MPI_Comm_rank(3) MPI MPI_Comm_rank(3)

NAME MPI_Comm_rank - Determines the rank of the calling process in the communicator

SYNOPSIS int MPI_Comm_rank( MPI_Comm comm, int *rank )

INPUT ARGUMENT comm - communicator (handle)

OUTPUT ARGUMENT rank - rank of the calling process in the group of comm (integer)

MPI_Send()

MPI_Send(3) MPI MPI_Send(3)

NAME MPI_Send - Performs a blocking send

SYNOPSIS int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)

INPUT PARAMETERS buf - initial address of send buffer (choice) count - number of elements in send buffer (nonnegative integer) datatype - datatype of each send buffer element (handle) dest - rank of destination (integer) tag - message tag (integer) comm - communicator (handle)

NOTES This routine may block until the message is received by the destination process.

MPI_Recv() MPI_Recv(3) MPI MPI_Recv(3)

NAME MPI_Recv - Blocking receive for a message

SYNOPSIS int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)

OUTPUT PARAMETERS buf - initial address of receive buffer (choice) status - status object (Status)

INPUT PARAMETERS count - maximum number of elements in receive buffer (integer) datatype - datatype of each receive buffer element (handle) source - rank of source (integer) tag - message tag (integer) comm - communicator (handle)

NOTES The count argument indicates the maximum length of a message; the actual length of the message can be determined with MPI_Get_count .

MPI_Isend()

MPI_Isend(3) MPI MPI_Isend(3)

NAME MPI_Isend - Begins a nonblocking send

SYNOPSIS int MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)

INPUT PARAMETERS buf - initial address of send buffer (choice) count - number of elements in send buffer (integer) datatype - datatype of each send buffer element (handle) dest - rank of destination (integer) tag - message tag (integer) comm - communicator (handle)

OUTPUT PARAMETER request - communication request (handle)

MPI_Irecv()

MPI_Irecv(3) MPI MPI_Irecv(3)

NAME MPI_Irecv - Begins a nonblocking receive

SYNOPSIS int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request)

INPUT PARAMETERS buf - initial address of receive buffer (choice) count - number of elements in receive buffer (integer) datatype - datatype of each receive buffer element (handle) source - rank of source (integer) tag - message tag (integer) comm - communicator (handle)

OUTPUT PARAMETER request - communication request (handle)

MPI_Bcast()

MPI_Bcast(3) MPI MPI_Bcast(3)

NAME MPI_Bcast - Broadcasts a message from the process with rank "root" to all other processes of the communicator

SYNOPSIS int MPI_Bcast( void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm )

INPUT/OUTPUT PARAMETER buffer - starting address of buffer (choice)

INPUT PARAMETERS count - number of entries in buffer (integer) datatype - data type of buffer (handle) root - rank of broadcast root (integer) comm - communicator (handle)

MPI_Allreduce()

MPI_Allreduce(3) MPI MPI_Allreduce(3)

NAME MPI_Allreduce - Combines values from all processes and distributes the result back to all processes

SYNOPSIS int MPI_Allreduce ( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm )

INPUT PARAMETERS sendbuf - starting address of send buffer (choice) count - number of elements in send buffer (integer) datatype - data type of elements of send buffer (handle) op - operation (handle) comm - communicator (handle)

OUTPUT PARAMETER recvbuf - starting address of receive buffer (choice)

MPI_Type_create_hvector()

MPI_Type_create_hvector(3) MPI MPI_Type_create_hvector(3)

NAME MPI_Type_create_hvector - Create a datatype with a constant stride given in bytes

SYNOPSIS int MPI_Type_create_hvector(int count, int blocklength, MPI_Aint stride, MPI_Datatype oldtype, MPI_Datatype *newtype)

INPUT PARAMETERS count - number of blocks (nonnegative integer) blocklength - number of elements in each block (nonnegative integer) stride - number of bytes between start of each block (address integer) oldtype - old datatype (handle)

OUTPUT PARAMETER newtype - new datatype (handle)

mpicc mpicc(1) MPI mpicc(1)

NAME mpicc - Compiles and links MPI programs written in C

DESCRIPTION This command can be used to compile and link MPI programs written in C. It provides the options and any special libraries that are needed to compile and link MPI programs.

It is important to use this command, particularly when linking programs, as it provides the necessary libraries.

COMMAND LINE ARGUMENTS -show - Show the commands that would be used without runnning them -help - Give short help -cc=name - Use compiler name instead of the default choice. Use this only if the compiler is compatible with the MPICH library (see below) -config=name - Load a configuration file for a particular compiler. This allows a single mpicc command to be used with multiple compilers.

mpiexec mpiexec(1) MPI mpiexec(1)

NAME mpiexec - Run an MPI program

SYNOPSIS mpiexec args executable pgmargs [ : args executable pgmargs ... ]

where args are command line arguments for mpiexec (see below), executable is the name of an executable MPI program, and pgmargs are command line arguments for the executable. Multiple executables can be specified by using the colon notation (for MPMD - Multiple Program Mul- tiple Data applications). For example, the following command will run the MPI program a.out on 4 processes: mpiexec -n 4 a.out

The MPI standard specifies the following arguments and their meanings:

-n <np> - Specify the number of processes to use -host <hostname> - Name of host on which to run processes -arch <architecture name> - Pick hosts with this architecture type

isbi mpi tutorial

sicortex mpi

cray microsoft mpi mpi

init mpi

intelmpi hpmpi mpi

recv mpi

mpi mpiexec np app mpi

multiple systems mpi

mpi programming thinking

Technology

institut seni dan budaya (isbi) bandung

initiative for social business innovation (isbi)

the superfrog isbi: chapter nine

using openacc with mpi tutorial - pgi compilers &...

artikulasi jersey persib 2014 - isbi

tutorial on the mpl interface to mpi victor eijkhout

mpi tutorial -...

the superfrog isbi: chapter one

how to set up and run ms mpi using ms visual...

the superfrog isbi: chapter eleven

the superfrog isbi: chapter eight

the superfrog isbi: chapter fifteen

the superfrog isbi: chapter six

the superfrog isbi: chapter seven

mpi tutorial

tutorial on mpi experimental environment for...

the star healer isbi: chapter 2

mpi tutorial - computer science at kent

mpi intro - lecture...

tutorial on mpi: the message-passing interface