august 22, 2005page 1 of (#) datacenter fabric workshop open mpi overview and current status tim...
TRANSCRIPT
![Page 1: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM](https://reader035.vdocument.in/reader035/viewer/2022062422/56649eff5503460f94c156da/html5/thumbnails/1.jpg)
August 22, 2005 Page 1 of (#)
Datacenter Fabric Workshop
Open MPIOverview and Current Status
Tim Woodall - LANLGalen Shipman - LANL/UNM
![Page 2: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM](https://reader035.vdocument.in/reader035/viewer/2022062422/56649eff5503460f94c156da/html5/thumbnails/2.jpg)
Page 2
Overview
• Point-to-Point Architecture
• OpenIB– Implementation– Results
• Future Work
![Page 3: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM](https://reader035.vdocument.in/reader035/viewer/2022062422/56649eff5503460f94c156da/html5/thumbnails/3.jpg)
Page 3
Point-to-Point Architecture
• Component Architecture:– “Plug-ins” for different capabilities (e.g. different networks)
– Tunable run-time parameters
• Three component frameworks:– Point-to-point messaging layer (PML) implements MPI semantics
– Byte Transfer Layer (BTL) abstracts network interfaces
– Memory Pool (mpool) provides for memory management/registration
![Page 4: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM](https://reader035.vdocument.in/reader035/viewer/2022062422/56649eff5503460f94c156da/html5/thumbnails/4.jpg)
Page 4
PML Framework
• Single PML manages multiple BTL modules– Maintains set of BTLs on a per-peer basis
– Message fragmentation and scheduling
• Implements MPI semantics− Synchronous / buffered / ready / normal sends
− Persistent requests / Request completion
• Eager/Rendezvous protocol− Eager send of short messages
− Configurable threshold (short vs. long)
− Multiple long protocols
![Page 5: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM](https://reader035.vdocument.in/reader035/viewer/2022062422/56649eff5503460f94c156da/html5/thumbnails/5.jpg)
Page 5
PML Protocols
• Send / receive pipeline to / from pre-registered buffers (non-contiguous data)
• MPI_Alloc_mem support– Red/black tree of memory registrations
– BTL associated with registration is used by scheduler
– Xfer of contiguous data with 1 RDMA (after match)
• “Leave pinned” run-time parameter– Registration on first-use
– MRU cache (configurable size) of registrations
– Bandwidth equivalent to pre-registered buffers (MPI_Alloc_mem)
![Page 6: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM](https://reader035.vdocument.in/reader035/viewer/2022062422/56649eff5503460f94c156da/html5/thumbnails/6.jpg)
Page 6
PML Protocols (Continued)
• Dynamic memory registration/deregistration– Fragment message and build pipeline of RDMA
requests– Overlap [de-]registration with RDMA– Bandwidth 97% of pre-registered memory at
large message sizes (8Mbytes)– Performance impacted by bus type/bandwidth
![Page 7: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM](https://reader035.vdocument.in/reader035/viewer/2022062422/56649eff5503460f94c156da/html5/thumbnails/7.jpg)
Page 7
BTL Framework
• MPI agnostic
• Provides simple API to upper layers– Tagged send/receive primitives
– One-sided put/get operations
• Access to data type engine for zero copy data transfer
• BTL modules natively support commodity networks:– Current (self, shared memory, myrinet GM/MX, Infiniband mvapi/OpenIB,
Portals, TCP)
– Planned (LAPI, Quadrics Elan4)
![Page 8: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM](https://reader035.vdocument.in/reader035/viewer/2022062422/56649eff5503460f94c156da/html5/thumbnails/8.jpg)
Page 8
OpenIB BTL
• BTL module initialization
• Resources allocation
• Connection management
• Small message Xfer
• Large message Xfer
• OpenIB Issues
• Future Work
![Page 9: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM](https://reader035.vdocument.in/reader035/viewer/2022062422/56649eff5503460f94c156da/html5/thumbnails/9.jpg)
Page 9
BTL module initialization
• A separate BTL module is initialized for each port on each HCA
• The PML schedules across these BTL modules just as any other interconnect
• When multiple BTL modules exist peers establish QP connections by matching subnets
![Page 10: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM](https://reader035.vdocument.in/reader035/viewer/2022062422/56649eff5503460f94c156da/html5/thumbnails/10.jpg)
Page 10
Resource Allocation
![Page 11: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM](https://reader035.vdocument.in/reader035/viewer/2022062422/56649eff5503460f94c156da/html5/thumbnails/11.jpg)
Page 11
SRQ Scalability
5105126481024
48256648512
24128648256
1264 648128
SRQ-
Mbytes
K*RQ per QP-
Mbytes
#postedFrag size-
Kbytes
Nodes
K- multiplier based on number of nodes
![Page 12: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM](https://reader035.vdocument.in/reader035/viewer/2022062422/56649eff5503460f94c156da/html5/thumbnails/12.jpg)
Page 12
Connection management
• Addressing information is exchanged dynamically via an OOB channel – This greatly improves scalability but at the
cost of increased first message latency– Connections are established with peers in the
same subnet (local subnet routing only)
![Page 13: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM](https://reader035.vdocument.in/reader035/viewer/2022062422/56649eff5503460f94c156da/html5/thumbnails/13.jpg)
Page 13
Small Message Xfer
– Maintain list of pre-registered fragments for send and recv
– List grows dynamically in chunks (more efficient to register)
– Small messages are copied to/from pre-registered buffers
– Recv descriptors are posted as needed based on min/max thresholds
![Page 14: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM](https://reader035.vdocument.in/reader035/viewer/2022062422/56649eff5503460f94c156da/html5/thumbnails/14.jpg)
Page 14
Small Message Performance
Average Latency
OpenMPI - OpenIB - *optimized 5.13 usec
OpenMPI - OpenIB - *defaults 5.43 usec
OpenMPI - Mvapi - *optimized 5.64 usec
OpenMPI - Mvapi - *defaults 5.94 usec
Mvapich - Mvapi (rdma/mem poll) 4.19 usec
Mvapich - Mvapi (send/recv) 6.51 usec
* Send/Recv based protocol
![Page 15: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM](https://reader035.vdocument.in/reader035/viewer/2022062422/56649eff5503460f94c156da/html5/thumbnails/15.jpg)
Page 15
Large Message Xfer
• RDMA Write and RDMA Read are both supported
• RDMA Read provides better performance than RDMA Write - control messages are reduced
• RDMA pipeline protocol performance highly dependent on I/O Bus performance
![Page 16: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM](https://reader035.vdocument.in/reader035/viewer/2022062422/56649eff5503460f94c156da/html5/thumbnails/16.jpg)
Page 16
Results OpenMPI/OpenIB - All
![Page 17: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM](https://reader035.vdocument.in/reader035/viewer/2022062422/56649eff5503460f94c156da/html5/thumbnails/17.jpg)
Page 17
Results OpenMPI/OpenIB - All - Log
![Page 18: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM](https://reader035.vdocument.in/reader035/viewer/2022062422/56649eff5503460f94c156da/html5/thumbnails/18.jpg)
Page 18
Results OpenMPI/OpenIB - Eager limit
![Page 19: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM](https://reader035.vdocument.in/reader035/viewer/2022062422/56649eff5503460f94c156da/html5/thumbnails/19.jpg)
Page 19
Results Combined Results
![Page 20: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM](https://reader035.vdocument.in/reader035/viewer/2022062422/56649eff5503460f94c156da/html5/thumbnails/20.jpg)
Page 20
Results Combined Results - Log
![Page 21: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM](https://reader035.vdocument.in/reader035/viewer/2022062422/56649eff5503460f94c156da/html5/thumbnails/21.jpg)
Page 21
OpenIB Opportunities
– User level notification of VM activity• Caching of memory registrations can be
dangerous • Need the ability to detect VM changes that effect
memory registrations (such as sbrk and munmap)
– Reliable Multicast for collectives – SRQ performance, 2/10 usec penalty, but
who’s counting?
![Page 22: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM](https://reader035.vdocument.in/reader035/viewer/2022062422/56649eff5503460f94c156da/html5/thumbnails/22.jpg)
Page 22
Future Work
• Small message RDMA (using working set of peers) - optional
• Dynamic connection management using Unreliable Datagrams
• Dynamic connection teardown - optional
![Page 23: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM](https://reader035.vdocument.in/reader035/viewer/2022062422/56649eff5503460f94c156da/html5/thumbnails/23.jpg)
Page 23
Source Code Access
• Subversion repository
• Download client from:– http://subversion.tigris.org/– v1.2.1 or later
• Check out with:– svn co http://svn.open-mpi.org/svn/ompi/trunk
ompi– Anonymous, read-only access
![Page 24: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM](https://reader035.vdocument.in/reader035/viewer/2022062422/56649eff5503460f94c156da/html5/thumbnails/24.jpg)
Page 24
Questions?
Tim Woodall
Email: [email protected]
Phone: 505-665-5224
Galen Shipman
Email: [email protected]
![Page 25: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM](https://reader035.vdocument.in/reader035/viewer/2022062422/56649eff5503460f94c156da/html5/thumbnails/25.jpg)
Page 25
Hardware Specs
• Dual Intel Xeon 3.2 GHz– 1024 KB Cache
• 2 Gbytes memory• Bus: Intel Corp. E7525/E7520/E7320 PCI
Express• Mellanox Technologies MT25208
InfiniHost III Ex• 288 Port Voltaire switch