presentation - programming a heterogeneous computing cluster

Programming a Heterogeneous Computing ClusterPRESENTED BY AASHRITH H. GOVINDRAJ

We’ll discuss the following today• Background of Heterogeneous Computing• Message Passing Interface(MPI)• Vector Addition Example(MPI Implementation)• More implementation details of MPI

Background• Heterogeneous Computing System(HCS)• High Performance Computing & its uses• Supercomputer vs. HCS• Why use Heterogeneous Computers in HCS?• MPI is the predominant message passing system for

Clusters

Introduction to MPI• MPI stands for Message Passing Interface• Predominant API• Runs on virtually any hardware platform• Programming Model – Distributed Memory Model• Supports Explicit Parallelism• Multiple Languages supported

Reasons for using MPI• Standardization• Portability• Performance Opportunities• Functionality• Availability

MPI Model• Flat view of the cluster to

programmer• SPMD Programming Model• No Global Memory• Inter-process

Communication is possible & required

• Process Synchronization Primitives

MPI Program Structure

• Required Header File• C - mpi.h• Fortran - mpif.h

MPI Thread Support

• Level 0 • Level 1• Level 2• Level 3

Format of MPI Calls• Format of MPI Calls

• Case Sensitivity• C – Yes• Fortran – No

• Name Restrictions• MPI_ *• PMPI_* ( Profiling

interface)

• Error Handling• Handled via return

parameter

Groups & CommunicatorsGroups – Ordered set of processesCommunicators – Handle to a group of processesMost MPI Routines require a communicator as argumentMPI_COMM_WORLD – Predefined Communicator that includes all processesRank – Unique ID

Environment Management Routines

• MPI_Init (&argc,&argv)• MPI_Comm_size (comm,&size)• MPI_Comm_rank (comm,&rank)• MPI_Abort (comm,errorcode)• MPI_Get_processor_name (&name,&resultlength)

Environment Management Routines (contd.)

• MPI_Get_version (&version,&subversion)• MPI_Initialized (&flag) • MPI_Wtime ()• MPI_Wtick ()• MPI_Finalize ()• Fortran – Extra parameter ierr in all functions

except time functions

Vector Addition Example

Vector Addition Example(contd.)

MPI Sending Data

MPI Receiving Data

MPI Barriers• int MPI_Barrier (comm)• comm – communicator

• This is very similar to barrier synchronization in CUDA• __syncthreads( )

Point-to-Point Operations• Typically involve two, and only two, different MPI threads• Different types of send and receive routines• Synchronous send• Blocking send / blocking receive• Non-blocking send / non-blocking receive• Buffered send• Combined send/receive• "Ready" send

• Send/Receive Routines not tightly coupled

Buffering• Why is buffering required?• It is Implementation Dependent• Opaque to the programmer and

managed by the MPI library• Advantages

• Can exist on the sending side, the receiving side, or both

• Improves program performance• Disadvantages

• A finite resource that can be easy to exhaust

• Often mysterious and not well documented

Blocking vs. Non-blockingBlocking Non Blocking

Send will only return after it’s safe to modify application buffer

Send/Receive return almost immediately

Receive returns after the data has arrived and ready for use by the application

Unsafe to modify our variables till we know send operation has been completed

Synchronous Communication is possible

Only Asynchronous Communication possible

Asynchronous Communication is also possible

Primarily used to overlap computation with communication to get performance gain

Order and Fairness• Order• MPI guarantees that messages will not overtake each other• Order rules do not apply if there are multiple threads participating

in the communication operations• Fairness• MPI does not guarantee fairness - it's up to the programmer to

prevent "operation starvation"

Types of Collective Communication Routines

Collective Communication Routines(contd.)• Scope• Must involve all processes within the scope of a communicator• Unexpected behavior, including program failure, can occur if even

one task in the communicator doesn't participate• Programmer's responsibility to ensure that all processes within a

communicator participate in any collective operations.• Collective communication functions are highly optimized

Groups & Communicators(additional details)• Group • Represented within system memory as an object• Only accessible as a handle• Always associated with a communicator object

• Communicator• Represented within system memory as an object.• In the simplest sense, the communicator is an extra "tag" that must be

included with MPI calls• Inter-group and Intra-group communicators available

• From the programmer's perspective, a group and a communicator are one

Primary Purposes of Group and Communicator Objects1. Allows you to organize tasks, based upon function, into

task groups.2. Enable Collective Communications operations across a

subset of related tasks.3. Provide basis for implementing user defined virtual

topologies4. Provide for safe communications

Programming Considerations and Restrictions• Groups/communicators are dynamic• Processes may be in more than one group/communicator• MPI provides over 40 routines related to groups, communicators, and virtual

topologies.• Typical usage:• Extract handle of global group from MPI_COMM_WORLD using MPI_Comm_group• Form new group as a subset of global group using MPI_Group_incl• Create new communicator for new group using MPI_Comm_create• Determine new rank in new communicator using MPI_Comm_rank• Conduct communications using any MPI message passing routine• When finished, free up new communicator and group (optional) using MPI_Comm_free

and MPI_Group_free

Virtual Topologies• Mapping/ordering of MPI processes into a geometric "shape“• Similar to CUDA Grid / Block 2D/3D structure• They are only Virtual• Two Main Types• Cartesian(grid)• Graph

• Virtual topologies are built upon MPI communicators and groups.• Must be "programmed" by the application developer.

Why use Virtual Topologies?• Convenience• Useful for applications with specific communication patterns

• Communication Efficiency• Penalty avoided on some hardware architectures for

communication between distant nodes• Process Mapping may be optimized based on physical

characteristics of the machine• MPI Implementation decides if VT is ignored or not

Pheew!…All done!Thank You!ANY QUESTIONS?

presentation - programming a heterogeneous computing cluster

Documents