mpich2 – a high-performance and widely portable open- source mpi implementation darius buntinas...
TRANSCRIPT
MPICH2 – A High-Performance and Widely Portable Open-Source MPI Implementation
Darius Buntinas
Argonne National Laboratory
Overview
MPICH2
– High-performance
– Open-source
– Widely portable MPICH2-based implementations
– IBM for BG/L and BG/P
– Cray for XT3/4
– Intel
– Microsoft
– SiCortex
– Myricom
– Ohio State
Outline
Architectural overview Nemesis – a new communication subsystem New features and optimizations
– Intranode communication
– Optimizing non-contiguous messages
– Optimizing large messages Current work in progress
– Optimizations
– Multi-threaded environments
– Process manager
– Other optimizations Libraries and tools
Traditional MPICH2 Developer APIs
Two APIs for porting MPICH2 to new communication architectures
– ADI3
– CH3 ADI3 – Implement a new device
– Richer interface• ~60 functions
– More work to port
– More flexibility CH3 – Implement a new CH3 channel
– Simpler interface• ~15 functions
– Easier to port
– Less flexibility
ROMIO
Sock
Application
MPI Layer
CH3 Device MXBG
MPEMPI Inteface
PVFS ...GPFS XFS
Nemesis
SCTP
SSHM
SHM
CH3 Interface. .
. .
TCP IB/iWARP PSM MX
Cray
Support for High-speed Networks
– 10-Gigabit Ethernet iWARP, Qlogic PSM,
InfiniBand, Myrinet (MX and GM)
Supports proprietary platforms
– BlueGene/L, BlueGene/P, SiCortex, Cray
Distribution with ROMIO MPI/IO library
Profiling and visualization tools (MPE, Jumpshot)
ADIO Interface
MPD
SMPD
Gforker
PM
I In
terf
ace
PM
PI
Nemesis Net Mod Interface
GM
Jumpshot
...
ADI3 Inteface
Nemesis
Nemesis is a new CH3 channel for MPICH2
– Shared-memory for intranode communication• Lock-free queues• Scalability• Improved intranode performance
– Network modules for internode communication• New interface
New developer API – Nemesis netmod interface
– Simpler interface than ADI3
– More flexible than CH3
Nemesis: Lock-Free Queues
Atomic memory operations Scalable
– One recv queue per process Optimized to reduce cache misses
RecvFree
RecvFree RecvFree
1
2
Nemesis Network Modules
Improved interface for network modules
– Allows optimized handling of noncontiguous data
– Allows optimized transfer of large data
– Optimized small contiguous message path • < 2.5us over QLogic PSM
Future work
– Multiple network modules• E.g., Myrinet for intra-cluster and TCP for inter-cluster
– Dynamically loadable
0
10
20
30
40
50
0 1 2 4 8 16 32 64 128 256 512Message size (byte)
Late
ncy
(use
c)
QLogic PSMGigabit Ethernet
Optimized Non-contiguous
Issues with non-contiguous data
– Representation
– Manipulation• Packing, generating other representations (e.g., iov), etc
Dataloops – MPICH2’s optimized internal datatype representation
– Efficiently describes non-contiguous data
– Utilities to efficiently manipulate non-contiguous data Dataloop is passed to network module
– Previously, an I/O vector was generated then passed
– Netmod implementation manipulates the dataloop. E.g., • TCP uses iov• IB, PSM, pack data into send buffer.
Optimized Large Message Transfer Using Rendezvous
MPICH2 uses rendezvous to transfer large messages
– Original implementation: channel was oblivious to rendezvous• CH3 sent RTS, CTS, DATA• Shared mem: Large messages would be sent through queue• Netmod: Netmod would perform its own rendezvous
– Shm: Queues may not be the most efficient mechanism to transfer large data• E.g., network RDMA, inter-process copy mechanism, copy buffer
– Netmod: Redundant rendezvous Developed LMT interface to support various mechanisms
– Sender transfers data (put)
– Receiver transfers data (get)
– Both sender and receiver participate in data transfer Modified CH3 to use LMT
– Works with rendezvous protocol
Optimization: LMT for Intranode Communication
For intranode, LMT copies through buffer in shared memory Sender allocates shared memory region
– Sends buffer ID to receiver in RTS packet Receiver attaches to memory region Both sender and receiver participate in transfer
– Use double-buffering
Sender Receiver
Current Work In Progress
Optimizations Multi-threaded environments Process manager Other work Atomic Operations Library
Current Optimization Work
Handle common case fast: Eager contiguous messages
– Identify this case early in the operation
– Call netmod’s send_eager_contig() function directly Bypass receive queue
– Currently: check unexp queue, post on posted queue, check network
– Optimized: check unexp queue, check network• Reduced instruction count by 48%
Eliminate function calls
– Collapse layers where possible Merge Nemesis with CH3
– Move Nemesis functionality to CH3
– CH3 shared memory support
– New CH3 channel/netmod interface Cache-aware placement of fields in structures
Fine Grained Threading
MPICH2 supports multi-threaded applications
– MPI_THREAD_MULTIPLE Currently, thread safety is implemented with a single lock
– Lock is acquired on entering an MPI function
– And released on exit
– Also released when making blocking communication system calls Limits concurrency in communication
– Only one thread can be in the progress engine at one time New architectures have multiple DMA engines for communication
– These can work independently of each other Concurrency is needed in the progress engine for maximum performance Even without independent network hardware
– Internal concurrency can improve performance
Multicore-Aware Collectives
Intra-node communication is much faster than inter-node Take advantage of this in collective algorithms E.g., Broadcast
– Send to one process per node, that process broadcasts to other processes on that node
Step further: collectives over shared memory
– E.g., Broadcast• Within a node, process writes data to shared
memory region• Other processes read data
– Issues• Memory traffic, cache misses, etc.
Process Manager
Enhanced support for third party process managers
– PBS, Slurm,
– Working on others Replacement for existing process managers
– Scalable to 10,000’s of nodes +
– Fault-tolerant
– Aware of topology
Other Work
Heterogeneous data representations
– Different architectures use different data representations• E.g., big/little-endian, 32/64-bit, IEEE floats/non-IEEE floats, etc
– Important for heterogeneous clusters and grids
– Use existing datatype manipulation utilities Fault-tolerance support
– CIFTS – fault-tolerance backplane
– Fault detection and reporting
Atomic Operations Library
Lock-free algorithms use atomic assembly instructions Assembly instructions are non-portable
– Must be ported for each architecture and compiler
We’re working on an atomic operations library
– Implementations for various architectures and various compilers
– Stand-alone library
– Not all atomic operations are natively supported on all architectures• E.g., some have LL-SC but no SWAP
– Such operations can be emulated using provided operations
Tools Included in MPICH2
MPE library for tracing MPI and other calls Scalable log file format (slog2) Jumpshot tool for visualizing log files
– Supports threads Collchk library for checking that the application calls collective operations
correctly
For more information…
MPICH2 website
– http://www.mcs.anl.gov/research/projects/mpich2
SVN repository
– svn co https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk mpich2
Developer pages
– http://wiki.mcs.anl.gov/mpich2/index.php/Developer_Documentation
Mailing lists
Me
– http://www.mcs.anl.gov/~buntinas