the role of infiniband technologies in high performance ......over 10.7 pb of raid 6 capacity 13,440...
TRANSCRIPT
![Page 1: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/1.jpg)
1 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08
1 Managed by UT-Battelle for the Department of Energy
The Role of InfiniBand Technologies in
High Performance Computing
![Page 2: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/2.jpg)
2 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Contributors
Gil Bloch
Noam Bloch
Hillel Chapman
Manjunath Gorentla-Venkata
Richard Graham
Michael Kagan
Josh Ladd
Vasily Philipov
Steve Poole
Ishai Rabinovich
Ariel Shahar
Gilad Shainer
Pavel Shamis
![Page 3: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/3.jpg)
3 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Outline
Spider file system
CORE-Direct
– InfiniBand overview
– New InfiniBand capabilities
– Software design for collective operations
– Results
![Page 4: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/4.jpg)
4 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08
4 Managed by UT-Battelle for the Department of Energy
Spider File System at the Oak Ridge
Leadership Computing Facility
![Page 5: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/5.jpg)
5 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Motivation for Spider File System
Building dedicated file systems for each platforms does not scale operationally
– Storage often 10% or more of new system cost
– Bundled storage often not poised to grow independently of attached machine
– Different curves for storage and compute technology
– Data needs to be moved between different compute islands
For example: Simulation platform to visualization platform
– Dedicated storage is only accessible when its machine is available
– Managing multiple file systems requires more manpower
data sharing path
JaguarXT5
Ewok
Lens
Smoky
Jaguar XT4
SION Network & Spider System
JaguarXT4
JaguarXT5 Ewok
LensSmoky
![Page 6: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/6.jpg)
6 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Spider: A System At Scale
Over 10.7 PB of RAID 6 Capacity
13,440 1TB drives
192 storage servers
Over 3 TB of memory (Lustre OSS)
Available to many compute systems through high-speed network:
– Over 3,000 IB ports
– Over 5 kilometer cables
Over 26,000 client mounts for I/O
Demonstrated I/O performance: 240 GB/s
Current Status
– in production use on all major OLCF computing platforms
![Page 7: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/7.jpg)
7 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Spider: Couplet and Scalable Cluster
Disks
280 in 5 trays
DDN Couplet
(2 controllers)
OSS (4 Dell nodes) 24 IB ports
Flextronics Switch
IB Ports
Uplink to
Cisco Core Switch
Disks
280 in 5 trays
DDN Couplet
(2 controllers)
OSS (4 Dell nodes) 24 IB ports
Flextronics Switch
IB Ports
Uplink to
Cisco Core Switch
280 1TB Disks
in 5 disk trays
DDN Couplet
(2 controllers)
OSS (4 Dell nodes) 24 IB ports
Flextronics Switch
IB Ports
Uplink to
Cisco Core Switch
A Scalable Cluster (SC)
SC SC SC SC
SC SC SC SC
SC SC SC SC
SC SC SC SC
16 SC Units on the floor
2 racks for each SC
![Page 8: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/8.jpg)
8 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Snapshot of Technical Challenges
Solved
Performance
– Asynchronous journaling
– Network congestion avoidance (topology aware I/O)
Scalability
– 26,000 clients
– 7 OST per OSS
– Lesson from server side client statistics
Fault Tolerance and Reliability
– Network, I/O server, Storage Array
SeaStar
Torus
Congestion
! "
#$%&"
' ( &) "
*) *#"
%&) ! "
$! +( &"
$( $#) "
! "
#! ! ! "
' ! ! ! "
*! ! ! "
%! ! ! "
$! ! ! ! "
$#! ! ! "
$' ! ! ! "
! " ) ! ! ! " $! ! ! ! " $) ! ! ! " #! ! ! ! " #) ! ! ! " ( ! ! ! ! "
!"#$%&'()#
* %+ , - .#/0#12'- &(3#
! - + / .4#0//(5. '&(#/ 77#
![Page 9: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/9.jpg)
9 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Spider - How Did We Get Here?
4 years project
We didn’t just pick up phone and order a center-wide file system
– No single vendor could deliver this system
– Trail blazing was required
Collaborative effort was key to success
– ORNL
– Cray
– DDN
– Cisco
– CFS, SUN, Oracle, and now Whamcloud
![Page 10: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/10.jpg)
10 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08
10 Managed by UT-Battelle for the Department of Energy
CORE-Direct Technology
![Page 11: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/11.jpg)
11 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Problems Being Addressed – Collective
Operations
Collective communication characteristics at scale
– Overlapping computation with communication – true asynchronous communications
– System noise
– Performance
– Scalability
Goal: Avoid using the CPU for communication processing
Offload Communication management to the network
![Page 12: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/12.jpg)
12 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Collective Communications
Communication pattern involving multiple processes (in MPI, all ranks in the communicator are involved)
Optimized collectives involve a communicator-wide data-dependent communication pattern
Data needs to be manipulated at intermediate stages of a collective operation
Collective operations limit application scalability
Collective operations magnify the effects of system-noise
![Page 13: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/13.jpg)
13 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Scalability of Collective Operations
Ideal Algorithm Impact of System Noise
3
1
2
4
![Page 14: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/14.jpg)
14 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Scalability of Collective Operations - II
Offloaded Algorithm Nonblocking Algorithm
- Communication processing
![Page 15: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/15.jpg)
15 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Approach to solving the problem
Co-design
– Network stack design (Mellanox)
– Hardware development (Mellanox)
– Application level requirement (ORNL)
– MPI/Shmem level implementation (Joint)
![Page 16: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/16.jpg)
16 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
InfiniBand Collective Offload – Key idea
Create local description of the communication patterns
Hand the description to the HCA
Manage collective communications at the network level
Poll for collective completion
Add new support for
– Synchronization primitives (hardware) Send Enable task
Receive Enable task
Wait task
– Multiple Work Request A sequence of network tasks
– Management Queue
![Page 17: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/17.jpg)
17 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
InfiniBand Hardware Changes
Tasks defined in the current standard
• Send
• Receive
• Read
• Write
• Atomic
New support
Synchronization primitives (hardware)
– Send Enable task
– Receive Enable task
– Wait task
Multiple Work Request
– A sequence of network tasks
Management Queue
![Page 18: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/18.jpg)
18 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Standard InfiniBand Connected Queue
Design
![Page 19: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/19.jpg)
19 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Small
data
Large
data
Credit
QP
Resource
recycling
Send Recv
Recv CQ
Send Recv
Recv CQ
Send Recv
Recv CQ
Send Recv
Recv CQ
Collective
MQ
MQ CQ Service
MQ
Send
CQ
All send
Queues
Per Communicator
Resources
Per
Peer
Resources
Queue Structure
![Page 20: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/20.jpg)
20 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Basic Collectives Framework Subgroup Framework
IB IB
OFFLOAD
Pt2Pt SM Socket IBNET Shared
Memory
Collective Framework
Tuned (pt2pt)
Collectives Comp.
MLNX
OFED
ML – Hierarchical
Collectives Comp.
MLNX
OFED
Module Component Architecture
OMPI
Collectives – Software Layers
![Page 21: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/21.jpg)
21 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Example – 4 Process Recursive Doubling
1 2 3 4
1 2 3 4
1 2 3 4
Step 1
Step 2
![Page 22: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/22.jpg)
22 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
4 Process Barrier Example
Proc 0 Proc 1 Proc 2 Proc 3
Exchange
With proc 1
Exchange
With proc 0
Exchange
With proc 3
Exchange
With proc 2
Exchange
With proc 2
Exchange
With proc 3
Exchange
With proc 0
Exchange
With proc 1
Proc 0 Proc 1 Proc 2 Proc 3
Send to
proc 1
Send to
proc 0
Send to
proc 3
Send to
proc 2
Wait on recv
from 1
Wait on recv
From 0
Wait on recv
From 3
Wait on recv
From 2
Send to
proc 2
Send to
proc 3
Send to
proc 0
Send to
proc 1
Wait on recv
from 2
Wait on recv
From 3
Wait on recv
From 0
Wait on recv
From 1
MWR
Algorithm
![Page 23: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/23.jpg)
23 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
4 Process Barrier Example – Queue view
Proc 0 Proc 1 Proc 2 Proc 3
Recv wait
from 1
Recv wait
from 0
Recv wait
from 3
Recv wait
from 2
Send enable
1
Send enable
0
Send enable
3
Send enable
2
Recv wait
from 2
Recv wait
from 3
Recv wait
from 0
Recv wait
from 1
MQ
Send QP Proc 0 Proc 1 Proc 2 Proc 3
Send to
proc 1 -
enabled
Send to
proc 0 –
enabled
Send to
proc 3 -
enabled
Send to
proc 2 -
enabled
Send to 2 –
not enabled
Send to 3 –
not enabled
Send to 0 –
not enabled
Send to 1 –
not enabled
Completion
![Page 24: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/24.jpg)
24 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
8 Process Barrier Example – Queue view
– no MQ, View at rank 0
QP 1 QP 2 QP 4
Send QP 1 Wait QP 1 Wait QP 1
Send QP 2 Wait QP 2
Send QP 4
Wait QP 4
![Page 25: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/25.jpg)
25 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Socket
Network
Nod
e
System
Unused
core Occupied core
System Hierarchy
![Page 26: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/26.jpg)
26 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Benchmarks
![Page 27: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/27.jpg)
27 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
System setup
8 node cluster
Node Architecture
– 3 GHz Intel Xeon
– Dual socket
– Quad core
Network
– ConnextX-2 HCA
– 36 port QDR switch running pre-release firmware
![Page 28: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/28.jpg)
28 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08
28 Managed by UT-Battelle for the Department of Energy
Barrier Data
![Page 29: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/29.jpg)
29 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
8 Node Blocking MPI Barrier
![Page 30: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/30.jpg)
30 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
MPI Barrier - Offloaded
![Page 31: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/31.jpg)
31 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
MPI Barrier – Comparison with PtP
![Page 32: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/32.jpg)
32 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
MPIX_Ibarrier Performance
![Page 33: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/33.jpg)
33 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Nonblocking Barrier – Overlap –
Multiple Work Quanta
![Page 34: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/34.jpg)
34 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Nonblocking Barrier – Overlap –
1 Work Quanta
![Page 35: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/35.jpg)
35 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08
35 Managed by UT-Battelle for the Department of Energy
Barrier Data
Hierarchy
![Page 36: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/36.jpg)
36 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Flat Barrier Algorithm
1 2 3 4
1 2 3 4
1 2 3 4
Host 1 Host 2
Inter Host
Communication
Step 1
Step 2
![Page 37: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/37.jpg)
37 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Hierarchical Barrier Algorithm
1 2 3 4
1 2 3 4
1 2 3 4
Host 1 Host 2
Inter Host
Communication
Step 1
Step 2
1 2 3 4
Step 3
![Page 38: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/38.jpg)
38 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
MPI Barrier timings
![Page 39: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/39.jpg)
39 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Barrier timings – blocking vs.
nonblocking
![Page 40: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/40.jpg)
40 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Nonblocking Barrier Overlap
![Page 41: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/41.jpg)
41 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08
41 Managed by UT-Battelle for the Department of Energy
Broadcast Data
![Page 42: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/42.jpg)
42 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
IB – Large Message Algorithm
ProcessI ProcessJ
QP
Send Send Wait
Recv Recv
CreditQP
Recv Recv
Send Send
QP
SendSendWait
RecvRecv
CreditQP
RecvRecv
SendSend
1)RegisterReceiveMemory
2)No fysender
3)Waitoncreditmessage
4)Senduserdata
![Page 43: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/43.jpg)
43 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Broadcast Latency – usec per call
Msg size IBOff + SM IBOff P2P + SM Open MPI
– default
MVAPICH
16B 3.48 16.11 2.55 5.58 5.81
1KB 4.87 23.96 5.66 12.20 10.46
8MB 25244 40735 28288 37343 41439
![Page 44: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/44.jpg)
44 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Nonblocking Broadcast Latency – usec
per call
Msg sizeß IBOff + SM IBOff P2P + SM
16B 3.58 19.79 2.57
1KB 4.96 27.44 5.70
8MB 26100 37855 28781
![Page 45: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/45.jpg)
45 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Broadcast – small data - hierarchical
![Page 46: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/46.jpg)
46 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Broadcast – large data - hierarchical
![Page 47: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/47.jpg)
47 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Overlap Measurement
Benchmark steps:
Polling Method 1. Post broadcast
2. Do work and poll for completion
3. Continue until broadcast completion
Post-work-wait Method 1. Post broadcast
2. Do work
3. Wait for broadcast completion
4. Compare the time of steps 1 – 3 with post-wait
5. Increase the work and repeat steps 1-4 until the time for post-work-wait is greater than post-wait
![Page 48: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/48.jpg)
48 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Nonblocking Broadcast – Overlap - Poll
![Page 49: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/49.jpg)
49 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Nonblocking Broadcast – Overlap - Wait
![Page 50: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/50.jpg)
50 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08
50 Managed by UT-Battelle for the Department of Energy
All-To-All Data
![Page 51: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/51.jpg)
51 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
All-To-All: 1 Byte
![Page 52: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/52.jpg)
52 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
All-To-All: 64 Bytes
![Page 53: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/53.jpg)
53 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
All-To-All: 128 Bytes
![Page 54: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/54.jpg)
54 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
All-To-All: 4 MB/process
![Page 55: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/55.jpg)
55 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08
55 Managed by UT-Battelle for the Department of Energy
Allgather Data
![Page 56: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/56.jpg)
56 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
All-Gather: 1 Byte
![Page 57: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/57.jpg)
57 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
All-Gather: 128 Bytes
![Page 58: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/58.jpg)
58 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
All-Gather: 131072 Bytes
![Page 59: The Role of InfiniBand Technologies in High Performance ......Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many](https://reader033.vdocument.in/reader033/viewer/2022050605/5fac8e88a17a8e7a9c6bb05b/html5/thumbnails/59.jpg)
59 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Summary
Added hardware support for offloading broadcast operations
Developed MPI-level support for one-copy for asynchronous contiguous large-data transfer
Good collective performance
Good overlap capabilities