presentation - programming a heterogeneous computing cluster
TRANSCRIPT
Programming a Heterogeneous Computing ClusterPRESENTED BY AASHRITH H. GOVINDRAJ
We’ll discuss the following today• Background of Heterogeneous Computing• Message Passing Interface(MPI)• Vector Addition Example(MPI Implementation)• More implementation details of MPI
Background• Heterogeneous Computing System(HCS)• High Performance Computing & its uses• Supercomputer vs. HCS• Why use Heterogeneous Computers in HCS?• MPI is the predominant message passing system for
Clusters
Introduction to MPI• MPI stands for Message Passing Interface• Predominant API• Runs on virtually any hardware platform• Programming Model – Distributed Memory Model• Supports Explicit Parallelism• Multiple Languages supported
Reasons for using MPI• Standardization• Portability• Performance Opportunities• Functionality• Availability
MPI Model• Flat view of the cluster to
programmer• SPMD Programming Model• No Global Memory• Inter-process
Communication is possible & required
• Process Synchronization Primitives
MPI Program Structure
• Required Header File• C - mpi.h• Fortran - mpif.h
MPI Thread Support
• Level 0 • Level 1• Level 2• Level 3
Format of MPI Calls• Format of MPI Calls
• Case Sensitivity• C – Yes• Fortran – No
• Name Restrictions• MPI_ *• PMPI_* ( Profiling
interface)
• Error Handling• Handled via return
parameter
Groups & CommunicatorsGroups – Ordered set of processesCommunicators – Handle to a group of processesMost MPI Routines require a communicator as argumentMPI_COMM_WORLD – Predefined Communicator that includes all processesRank – Unique ID
Environment Management Routines
• MPI_Init (&argc,&argv)• MPI_Comm_size (comm,&size)• MPI_Comm_rank (comm,&rank)• MPI_Abort (comm,errorcode)• MPI_Get_processor_name (&name,&resultlength)
Environment Management Routines (contd.)
• MPI_Get_version (&version,&subversion)• MPI_Initialized (&flag) • MPI_Wtime ()• MPI_Wtick ()• MPI_Finalize ()• Fortran – Extra parameter ierr in all functions
except time functions
Vector Addition Example
Vector Addition Example(contd.)
MPI Sending Data
MPI Receiving Data
Vector Addition Example(contd.)
Vector Addition Example(contd.)
Vector Addition Example(contd.)
MPI Barriers• int MPI_Barrier (comm)• comm – communicator
• This is very similar to barrier synchronization in CUDA• __syncthreads( )
Vector Addition Example(contd.)
Vector Addition Example(contd.)
Vector Addition Example(contd.)
Vector Addition Example(contd.)
Point-to-Point Operations• Typically involve two, and only two, different MPI threads• Different types of send and receive routines• Synchronous send• Blocking send / blocking receive• Non-blocking send / non-blocking receive• Buffered send• Combined send/receive• "Ready" send
• Send/Receive Routines not tightly coupled
Buffering• Why is buffering required?• It is Implementation Dependent• Opaque to the programmer and
managed by the MPI library• Advantages
• Can exist on the sending side, the receiving side, or both
• Improves program performance• Disadvantages
• A finite resource that can be easy to exhaust
• Often mysterious and not well documented
Blocking vs. Non-blockingBlocking Non Blocking
Send will only return after it’s safe to modify application buffer
Send/Receive return almost immediately
Receive returns after the data has arrived and ready for use by the application
Unsafe to modify our variables till we know send operation has been completed
Synchronous Communication is possible
Only Asynchronous Communication possible
Asynchronous Communication is also possible
Primarily used to overlap computation with communication to get performance gain
Order and Fairness• Order• MPI guarantees that messages will not overtake each other• Order rules do not apply if there are multiple threads participating
in the communication operations• Fairness• MPI does not guarantee fairness - it's up to the programmer to
prevent "operation starvation"
Types of Collective Communication Routines
Collective Communication Routines(contd.)• Scope• Must involve all processes within the scope of a communicator• Unexpected behavior, including program failure, can occur if even
one task in the communicator doesn't participate• Programmer's responsibility to ensure that all processes within a
communicator participate in any collective operations.• Collective communication functions are highly optimized
Groups & Communicators(additional details)• Group • Represented within system memory as an object• Only accessible as a handle• Always associated with a communicator object
• Communicator• Represented within system memory as an object.• In the simplest sense, the communicator is an extra "tag" that must be
included with MPI calls• Inter-group and Intra-group communicators available
• From the programmer's perspective, a group and a communicator are one
Primary Purposes of Group and Communicator Objects1. Allows you to organize tasks, based upon function, into
task groups.2. Enable Collective Communications operations across a
subset of related tasks.3. Provide basis for implementing user defined virtual
topologies4. Provide for safe communications
Programming Considerations and Restrictions• Groups/communicators are dynamic• Processes may be in more than one group/communicator• MPI provides over 40 routines related to groups, communicators, and virtual
topologies.• Typical usage:• Extract handle of global group from MPI_COMM_WORLD using MPI_Comm_group• Form new group as a subset of global group using MPI_Group_incl• Create new communicator for new group using MPI_Comm_create• Determine new rank in new communicator using MPI_Comm_rank• Conduct communications using any MPI message passing routine• When finished, free up new communicator and group (optional) using MPI_Comm_free
and MPI_Group_free
Virtual Topologies• Mapping/ordering of MPI processes into a geometric "shape“• Similar to CUDA Grid / Block 2D/3D structure• They are only Virtual• Two Main Types• Cartesian(grid)• Graph
• Virtual topologies are built upon MPI communicators and groups.• Must be "programmed" by the application developer.
Why use Virtual Topologies?• Convenience• Useful for applications with specific communication patterns
• Communication Efficiency• Penalty avoided on some hardware architectures for
communication between distant nodes• Process Mapping may be optimized based on physical
characteristics of the machine• MPI Implementation decides if VT is ignored or not
Pheew!…All done!Thank You!ANY QUESTIONS?