mpich2 – a high-performance and widely portable open- source mpi implementation darius buntinas...

20
MPICH2 – A High-Performance and Widely Portable Open- Source MPI Implementation Darius Buntinas Argonne National Laboratory

Upload: opal-webb

Post on 31-Dec-2015

224 views

Category:

Documents


2 download

TRANSCRIPT

MPICH2 – A High-Performance and Widely Portable Open-Source MPI Implementation

Darius Buntinas

Argonne National Laboratory

Overview

MPICH2

– High-performance

– Open-source

– Widely portable MPICH2-based implementations

– IBM for BG/L and BG/P

– Cray for XT3/4

– Intel

– Microsoft

– SiCortex

– Myricom

– Ohio State

Outline

Architectural overview Nemesis – a new communication subsystem New features and optimizations

– Intranode communication

– Optimizing non-contiguous messages

– Optimizing large messages Current work in progress

– Optimizations

– Multi-threaded environments

– Process manager

– Other optimizations Libraries and tools

Traditional MPICH2 Developer APIs

Two APIs for porting MPICH2 to new communication architectures

– ADI3

– CH3 ADI3 – Implement a new device

– Richer interface• ~60 functions

– More work to port

– More flexibility CH3 – Implement a new CH3 channel

– Simpler interface• ~15 functions

– Easier to port

– Less flexibility

ROMIO

Sock

Application

MPI Layer

CH3 Device MXBG

MPEMPI Inteface

PVFS ...GPFS XFS

Nemesis

SCTP

SSHM

SHM

CH3 Interface. .

. .

TCP IB/iWARP PSM MX

Cray

Support for High-speed Networks

– 10-Gigabit Ethernet iWARP, Qlogic PSM,

InfiniBand, Myrinet (MX and GM)

Supports proprietary platforms

– BlueGene/L, BlueGene/P, SiCortex, Cray

Distribution with ROMIO MPI/IO library

Profiling and visualization tools (MPE, Jumpshot)

ADIO Interface

MPD

SMPD

Gforker

PM

I In

terf

ace

PM

PI

Nemesis Net Mod Interface

GM

Jumpshot

...

ADI3 Inteface

Nemesis

Nemesis is a new CH3 channel for MPICH2

– Shared-memory for intranode communication• Lock-free queues• Scalability• Improved intranode performance

– Network modules for internode communication• New interface

New developer API – Nemesis netmod interface

– Simpler interface than ADI3

– More flexible than CH3

Nemesis: Lock-Free Queues

Atomic memory operations Scalable

– One recv queue per process Optimized to reduce cache misses

RecvFree

RecvFree RecvFree

1

2

Nemesis Network Modules

Improved interface for network modules

– Allows optimized handling of noncontiguous data

– Allows optimized transfer of large data

– Optimized small contiguous message path • < 2.5us over QLogic PSM

Future work

– Multiple network modules• E.g., Myrinet for intra-cluster and TCP for inter-cluster

– Dynamically loadable

0

10

20

30

40

50

0 1 2 4 8 16 32 64 128 256 512Message size (byte)

Late

ncy

(use

c)

QLogic PSMGigabit Ethernet

Optimized Non-contiguous

Issues with non-contiguous data

– Representation

– Manipulation• Packing, generating other representations (e.g., iov), etc

Dataloops – MPICH2’s optimized internal datatype representation

– Efficiently describes non-contiguous data

– Utilities to efficiently manipulate non-contiguous data Dataloop is passed to network module

– Previously, an I/O vector was generated then passed

– Netmod implementation manipulates the dataloop. E.g., • TCP uses iov• IB, PSM, pack data into send buffer.

Optimized Large Message Transfer Using Rendezvous

MPICH2 uses rendezvous to transfer large messages

– Original implementation: channel was oblivious to rendezvous• CH3 sent RTS, CTS, DATA• Shared mem: Large messages would be sent through queue• Netmod: Netmod would perform its own rendezvous

– Shm: Queues may not be the most efficient mechanism to transfer large data• E.g., network RDMA, inter-process copy mechanism, copy buffer

– Netmod: Redundant rendezvous Developed LMT interface to support various mechanisms

– Sender transfers data (put)

– Receiver transfers data (get)

– Both sender and receiver participate in data transfer Modified CH3 to use LMT

– Works with rendezvous protocol

Optimization: LMT for Intranode Communication

For intranode, LMT copies through buffer in shared memory Sender allocates shared memory region

– Sends buffer ID to receiver in RTS packet Receiver attaches to memory region Both sender and receiver participate in transfer

– Use double-buffering

Sender Receiver

Current Work In Progress

Optimizations Multi-threaded environments Process manager Other work Atomic Operations Library

Current Optimization Work

Handle common case fast: Eager contiguous messages

– Identify this case early in the operation

– Call netmod’s send_eager_contig() function directly Bypass receive queue

– Currently: check unexp queue, post on posted queue, check network

– Optimized: check unexp queue, check network• Reduced instruction count by 48%

Eliminate function calls

– Collapse layers where possible Merge Nemesis with CH3

– Move Nemesis functionality to CH3

– CH3 shared memory support

– New CH3 channel/netmod interface Cache-aware placement of fields in structures

Fine Grained Threading

MPICH2 supports multi-threaded applications

– MPI_THREAD_MULTIPLE Currently, thread safety is implemented with a single lock

– Lock is acquired on entering an MPI function

– And released on exit

– Also released when making blocking communication system calls Limits concurrency in communication

– Only one thread can be in the progress engine at one time New architectures have multiple DMA engines for communication

– These can work independently of each other Concurrency is needed in the progress engine for maximum performance Even without independent network hardware

– Internal concurrency can improve performance

Multicore-Aware Collectives

Intra-node communication is much faster than inter-node Take advantage of this in collective algorithms E.g., Broadcast

– Send to one process per node, that process broadcasts to other processes on that node

Step further: collectives over shared memory

– E.g., Broadcast• Within a node, process writes data to shared

memory region• Other processes read data

– Issues• Memory traffic, cache misses, etc.

Process Manager

Enhanced support for third party process managers

– PBS, Slurm,

– Working on others Replacement for existing process managers

– Scalable to 10,000’s of nodes +

– Fault-tolerant

– Aware of topology

Other Work

Heterogeneous data representations

– Different architectures use different data representations• E.g., big/little-endian, 32/64-bit, IEEE floats/non-IEEE floats, etc

– Important for heterogeneous clusters and grids

– Use existing datatype manipulation utilities Fault-tolerance support

– CIFTS – fault-tolerance backplane

– Fault detection and reporting

Atomic Operations Library

Lock-free algorithms use atomic assembly instructions Assembly instructions are non-portable

– Must be ported for each architecture and compiler

We’re working on an atomic operations library

– Implementations for various architectures and various compilers

– Stand-alone library

– Not all atomic operations are natively supported on all architectures• E.g., some have LL-SC but no SWAP

– Such operations can be emulated using provided operations

Tools Included in MPICH2

MPE library for tracing MPI and other calls Scalable log file format (slog2) Jumpshot tool for visualizing log files

– Supports threads Collchk library for checking that the application calls collective operations

correctly

For more information…

MPICH2 website

– http://www.mcs.anl.gov/research/projects/mpich2

SVN repository

– svn co https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk mpich2

Developer pages

– http://wiki.mcs.anl.gov/mpich2/index.php/Developer_Documentation

Mailing lists

[email protected]

[email protected]

Me

[email protected]

– http://www.mcs.anl.gov/~buntinas