august 22, 2005page 1 of (#) datacenter fabric workshop open mpi overview and current status tim...

25
August 22, 2005 Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM

Upload: emery-howard

Post on 03-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM

August 22, 2005 Page 1 of (#)

Datacenter Fabric Workshop

Open MPIOverview and Current Status

Tim Woodall - LANLGalen Shipman - LANL/UNM

Page 2: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM

Page 2

Overview

• Point-to-Point Architecture

• OpenIB– Implementation– Results

• Future Work

Page 3: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM

Page 3

Point-to-Point Architecture

• Component Architecture:– “Plug-ins” for different capabilities (e.g. different networks)

– Tunable run-time parameters

• Three component frameworks:– Point-to-point messaging layer (PML) implements MPI semantics

– Byte Transfer Layer (BTL) abstracts network interfaces

– Memory Pool (mpool) provides for memory management/registration

Page 4: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM

Page 4

PML Framework

• Single PML manages multiple BTL modules– Maintains set of BTLs on a per-peer basis

– Message fragmentation and scheduling

• Implements MPI semantics− Synchronous / buffered / ready / normal sends

− Persistent requests / Request completion

• Eager/Rendezvous protocol− Eager send of short messages

− Configurable threshold (short vs. long)

− Multiple long protocols

Page 5: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM

Page 5

PML Protocols

• Send / receive pipeline to / from pre-registered buffers (non-contiguous data)

• MPI_Alloc_mem support– Red/black tree of memory registrations

– BTL associated with registration is used by scheduler

– Xfer of contiguous data with 1 RDMA (after match)

• “Leave pinned” run-time parameter– Registration on first-use

– MRU cache (configurable size) of registrations

– Bandwidth equivalent to pre-registered buffers (MPI_Alloc_mem)

Page 6: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM

Page 6

PML Protocols (Continued)

• Dynamic memory registration/deregistration– Fragment message and build pipeline of RDMA

requests– Overlap [de-]registration with RDMA– Bandwidth 97% of pre-registered memory at

large message sizes (8Mbytes)– Performance impacted by bus type/bandwidth

Page 7: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM

Page 7

BTL Framework

• MPI agnostic

• Provides simple API to upper layers– Tagged send/receive primitives

– One-sided put/get operations

• Access to data type engine for zero copy data transfer

• BTL modules natively support commodity networks:– Current (self, shared memory, myrinet GM/MX, Infiniband mvapi/OpenIB,

Portals, TCP)

– Planned (LAPI, Quadrics Elan4)

Page 8: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM

Page 8

OpenIB BTL

• BTL module initialization

• Resources allocation

• Connection management

• Small message Xfer

• Large message Xfer

• OpenIB Issues

• Future Work

Page 9: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM

Page 9

BTL module initialization

• A separate BTL module is initialized for each port on each HCA

• The PML schedules across these BTL modules just as any other interconnect

• When multiple BTL modules exist peers establish QP connections by matching subnets

Page 10: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM

Page 10

Resource Allocation

Page 11: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM

Page 11

SRQ Scalability

5105126481024

48256648512

24128648256

1264 648128

SRQ-

Mbytes

K*RQ per QP-

Mbytes

#postedFrag size-

Kbytes

Nodes

K- multiplier based on number of nodes

Page 12: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM

Page 12

Connection management

• Addressing information is exchanged dynamically via an OOB channel – This greatly improves scalability but at the

cost of increased first message latency– Connections are established with peers in the

same subnet (local subnet routing only)

Page 13: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM

Page 13

Small Message Xfer

– Maintain list of pre-registered fragments for send and recv

– List grows dynamically in chunks (more efficient to register)

– Small messages are copied to/from pre-registered buffers

– Recv descriptors are posted as needed based on min/max thresholds

Page 14: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM

Page 14

Small Message Performance

Average Latency

OpenMPI - OpenIB - *optimized 5.13 usec

OpenMPI - OpenIB - *defaults 5.43 usec

OpenMPI - Mvapi - *optimized 5.64 usec

OpenMPI - Mvapi - *defaults 5.94 usec

Mvapich - Mvapi (rdma/mem poll) 4.19 usec

Mvapich - Mvapi (send/recv) 6.51 usec

* Send/Recv based protocol

Page 15: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM

Page 15

Large Message Xfer

• RDMA Write and RDMA Read are both supported

• RDMA Read provides better performance than RDMA Write - control messages are reduced

• RDMA pipeline protocol performance highly dependent on I/O Bus performance

Page 16: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM

Page 16

Results OpenMPI/OpenIB - All

Page 17: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM

Page 17

Results OpenMPI/OpenIB - All - Log

Page 18: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM

Page 18

Results OpenMPI/OpenIB - Eager limit

Page 19: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM

Page 19

Results Combined Results

Page 20: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM

Page 20

Results Combined Results - Log

Page 21: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM

Page 21

OpenIB Opportunities

– User level notification of VM activity• Caching of memory registrations can be

dangerous • Need the ability to detect VM changes that effect

memory registrations (such as sbrk and munmap)

– Reliable Multicast for collectives – SRQ performance, 2/10 usec penalty, but

who’s counting?

Page 22: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM

Page 22

Future Work

• Small message RDMA (using working set of peers) - optional

• Dynamic connection management using Unreliable Datagrams

• Dynamic connection teardown - optional

Page 23: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM

Page 23

Source Code Access

• Subversion repository

• Download client from:– http://subversion.tigris.org/– v1.2.1 or later

• Check out with:– svn co http://svn.open-mpi.org/svn/ompi/trunk

ompi– Anonymous, read-only access

Page 24: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM

Page 24

Questions?

Tim Woodall

Email: [email protected]

Phone: 505-665-5224

Galen Shipman

Email: [email protected]

Page 25: August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM

Page 25

Hardware Specs

• Dual Intel Xeon 3.2 GHz– 1024 KB Cache

• 2 Gbytes memory• Bus: Intel Corp. E7525/E7520/E7320 PCI

Express• Mellanox Technologies MT25208

InfiniHost III Ex• 288 Port Voltaire switch