![Page 1: Next Generation of Co-Processors Emerges: In-Network Computingsc16.supercomputing.org/sc-archive/bof/bof_files/bof116s2.pdf · The latest revolution in HPC is the move to a co-design](https://reader036.vdocument.in/reader036/viewer/2022070712/5ece2affee11c142a623d765/html5/thumbnails/1.jpg)
© 2016 Mellanox Technologies 1
The latest revolution in HPC is the move to a co-design architecture, a collaborative effort among industry,
academia, and manufacturers to reach exascale performance by taking a holistic system-level approach to
fundamental performance improvements. Co-design architecture exploits system efficiency and optimizes
performance by creating synergies between the hardware and the software.
Co-design recognizes that the CPU has reached the limits of its scalability and offers an intelligent network
as the new “co-processor” to share the responsibility for handling and accelerating application workloads. By
placing data-related algorithms on an intelligent network, we can dramatically improve the data center and
applications performance.
Next Generation of Co-Processors Emerges: In-Network Computing
![Page 2: Next Generation of Co-Processors Emerges: In-Network Computingsc16.supercomputing.org/sc-archive/bof/bof_files/bof116s2.pdf · The latest revolution in HPC is the move to a co-design](https://reader036.vdocument.in/reader036/viewer/2022070712/5ece2affee11c142a623d765/html5/thumbnails/2.jpg)
Smart Interconnects:
The Next Key Driver of HPC Performance Gains
2
“For Greater HPC Performance Doing More of the Same Will Not Cut It”
Bob Sorensen
VP Research, HPC Group @ IDC
“Smart Interconnect Promises a Way Forward”
![Page 3: Next Generation of Co-Processors Emerges: In-Network Computingsc16.supercomputing.org/sc-archive/bof/bof_files/bof116s2.pdf · The latest revolution in HPC is the move to a co-design](https://reader036.vdocument.in/reader036/viewer/2022070712/5ece2affee11c142a623d765/html5/thumbnails/3.jpg)
© 2016 Mellanox Technologies 3
The Ever Growing Demand for Higher Performance
2000 202020102005
“Roadrunner”
1st
2015
Terascale Petascale Exascale
Single-Core to Many-CoreSMP to Clusters
Performance Development
Co-Design
HW SW
APP
Hardware
Software
Application
The Interconnect is the Enabling Technology
![Page 4: Next Generation of Co-Processors Emerges: In-Network Computingsc16.supercomputing.org/sc-archive/bof/bof_files/bof116s2.pdf · The latest revolution in HPC is the move to a co-design](https://reader036.vdocument.in/reader036/viewer/2022070712/5ece2affee11c142a623d765/html5/thumbnails/4.jpg)
© 2016 Mellanox Technologies 4
Exponential Data Growth – The Need for Intelligent and Faster Interconnect
CPU-Centric (Onload) Data-Centric (Offload)
Faster Data Speeds and In-Network Computing Enable Higher Performance and Scale
Must Wait for the Data
Creates Performance BottlenecksAnalyze Data as it Moves!
![Page 5: Next Generation of Co-Processors Emerges: In-Network Computingsc16.supercomputing.org/sc-archive/bof/bof_files/bof116s2.pdf · The latest revolution in HPC is the move to a co-design](https://reader036.vdocument.in/reader036/viewer/2022070712/5ece2affee11c142a623d765/html5/thumbnails/5.jpg)
© 2016 Mellanox Technologies 5
Breaking the Application Latency Wall
Today: Network device latencies are on the order of 100 nanoseconds
Challenge: Enabling the next order of magnitude improvement in application performance
Solution: Creating synergies between software and hardware – intelligent interconnect
Intelligent Interconnect Paves the Road to Exascale Performance
10 years ago
~10
microsecond
~100
microsecond
NetworkCommunication
Framework
Today
~10
microsecond
Communication
Framework
~0.1
microsecond
Network
~1
microsecond
Communication
Framework
Future
~0.05
microsecond
Co-Design
Network
![Page 6: Next Generation of Co-Processors Emerges: In-Network Computingsc16.supercomputing.org/sc-archive/bof/bof_files/bof116s2.pdf · The latest revolution in HPC is the move to a co-design](https://reader036.vdocument.in/reader036/viewer/2022070712/5ece2affee11c142a623d765/html5/thumbnails/6.jpg)
© 2016 Mellanox Technologies 6
In-Network Computing and Acceleration Engines
RDMA Collectives
Tag Matching Security
Storage
Network
Transport
Most Efficient Data Access and Data
Movement for Compute and Storage
platforms
200G with <1%CPU Utilization
CORE-Direct and SHArP Technologies
Executes and Manages Collective
Communications
Accelerates MPI, PGAS/SHMEM and UPC,
Map/Reduce and more
Communication Performance
MPI Tag-Matching Offload
MPI Rendezvous Protocol Offload
Accelerates MPI Application Performance
All Communications Managed and Operated by
the Network Hardware; Adaptive Routing and
Congestion Management
Maximizes CPU Availability for Applications,
increases Network Efficiency
NVMe over Fabrics Offloads, T10-DIF and
Erasure Coding offloads
Efficient End-to-End Data Protection,
Background Check-Pointing (burst-buffer) and
More. Increase System Performance and CPU
Availability
Data Encryption / Decryption (IEEE XTS standard)
and Key Management; Federal Information
Processing Standards (FIPS) Compliant
Enhances Data Security Options, Enables
Protection Between Users Sharing the Same
Resources (Different Keys)
![Page 7: Next Generation of Co-Processors Emerges: In-Network Computingsc16.supercomputing.org/sc-archive/bof/bof_files/bof116s2.pdf · The latest revolution in HPC is the move to a co-design](https://reader036.vdocument.in/reader036/viewer/2022070712/5ece2affee11c142a623d765/html5/thumbnails/7.jpg)
© 2016 Mellanox Technologies 7
Scalable Hierarchical Aggregation Protocol (SHArP)
Reliable Scalable General Purpose Primitive
• In-network Tree based aggregation mechanism
• Large number of groups
• Multiple simultaneous outstanding operations
Applicable to Multiple Use-cases
• HPC Applications using MPI / SHMEM
• Distributed Machine Learning applications
Scalable High Performance Collective Offload
• Barrier, Reduce, All-Reduce, Broadcast
• Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND
• Integer and Floating-Point, 32 / 64 bit
SHArP Tree
SHArP Tree Aggregation Node
(Process running on HCA)
SHArP Tree Endnode
(Process running on HCA)
SHArP Tree Root
![Page 8: Next Generation of Co-Processors Emerges: In-Network Computingsc16.supercomputing.org/sc-archive/bof/bof_files/bof116s2.pdf · The latest revolution in HPC is the move to a co-design](https://reader036.vdocument.in/reader036/viewer/2022070712/5ece2affee11c142a623d765/html5/thumbnails/8.jpg)
© 2016 Mellanox Technologies 8
SHArP Performance Advantage
MiniFE is a Finite Element mini-application
• Implements kernels that represent
implicit finite-element applications
Allreduce MPI Collective
![Page 9: Next Generation of Co-Processors Emerges: In-Network Computingsc16.supercomputing.org/sc-archive/bof/bof_files/bof116s2.pdf · The latest revolution in HPC is the move to a co-design](https://reader036.vdocument.in/reader036/viewer/2022070712/5ece2affee11c142a623d765/html5/thumbnails/9.jpg)
© 2016 Mellanox Technologies 9
HPC-X with SHArP Technology
OpenFOAM is a popular computational fluid dynamics application
![Page 10: Next Generation of Co-Processors Emerges: In-Network Computingsc16.supercomputing.org/sc-archive/bof/bof_files/bof116s2.pdf · The latest revolution in HPC is the move to a co-design](https://reader036.vdocument.in/reader036/viewer/2022070712/5ece2affee11c142a623d765/html5/thumbnails/10.jpg)
ORNL is managed by UT-Battelle
for the US Department of Energy
From Titan to Summit:Accelerating Everything
Scott Atchley
System Architecture Team Lead
Technology Integration Group
Oak Ridge Leadership Computing Facility
Oak Ridge National Laboratory
![Page 11: Next Generation of Co-Processors Emerges: In-Network Computingsc16.supercomputing.org/sc-archive/bof/bof_files/bof116s2.pdf · The latest revolution in HPC is the move to a co-design](https://reader036.vdocument.in/reader036/viewer/2022070712/5ece2affee11c142a623d765/html5/thumbnails/11.jpg)
11 From Titan to Summit: Accelerating Everything
Outline
• Mission of the Oak Ridge Leadership Computing Facility
• Titan – Accelerating Computing
• Summit – Accelerating Everything
![Page 12: Next Generation of Co-Processors Emerges: In-Network Computingsc16.supercomputing.org/sc-archive/bof/bof_files/bof116s2.pdf · The latest revolution in HPC is the move to a co-design](https://reader036.vdocument.in/reader036/viewer/2022070712/5ece2affee11c142a623d765/html5/thumbnails/12.jpg)
12 From Titan to Summit: Accelerating Everything
US Department of Energy’s Office of Science Computation User Facilities
• DOE is leader in open High-Performance Computing
• Provide the world’s most powerful computational tools for open science
• Access is free to researchers who publish
• Boost US competitiveness
• Attract the best and brightest researchers
NERSC Edison is 2.57 PF
OLCFTitan is 27 PF
ALCFMira is 10 PF
![Page 13: Next Generation of Co-Processors Emerges: In-Network Computingsc16.supercomputing.org/sc-archive/bof/bof_files/bof116s2.pdf · The latest revolution in HPC is the move to a co-design](https://reader036.vdocument.in/reader036/viewer/2022070712/5ece2affee11c142a623d765/html5/thumbnails/13.jpg)
13 From Titan to Summit: Accelerating Everything
Oak Ridge Leadership Computing Facility (OLCF)Mission: Deploy and operate the computational resources required to address global challenges
Providing world-leading computational and data resources and specialized services for the most computationally intensive problems
Providing stable hardware/software path of increasing scale to maximize productive applications development
Providing the resources to investigate otherwise inaccessible systems at every scale: from galaxy formation to supernovae to earth systems to jet engines to automobiles to nanomaterials
With our partners, deliver transforming discoveries in materials, biology, climate, energy technologies, and basic science
CORAL SystemJaguar: 2.3 PFMulti-core CPU7 MW
Titan: 27 PFHybrid GPU/CPU9 MW
2010 2013 2017 2022
OLCF5: 5-10x Summit~20 MWSummit: 5-10x Titan
Hybrid GPU/CPU10 MW
Exascale System
![Page 14: Next Generation of Co-Processors Emerges: In-Network Computingsc16.supercomputing.org/sc-archive/bof/bof_files/bof116s2.pdf · The latest revolution in HPC is the move to a co-design](https://reader036.vdocument.in/reader036/viewer/2022070712/5ece2affee11c142a623d765/html5/thumbnails/14.jpg)
14 From Titan to Summit: Accelerating Everything
Path to Accelerated Computing
• In 2009, Jaguar’s peak performance was 2.3 PF and it used 7MW of power
• Users needed 10x performance in the next generation system to meet their science goals
• We could get 2x with CPUs – Too much power to get to 10x
![Page 15: Next Generation of Co-Processors Emerges: In-Network Computingsc16.supercomputing.org/sc-archive/bof/bof_files/bof116s2.pdf · The latest revolution in HPC is the move to a co-design](https://reader036.vdocument.in/reader036/viewer/2022070712/5ece2affee11c142a623d765/html5/thumbnails/15.jpg)
15 From Titan to Summit: Accelerating Everything
GPUs provided a path forward using Hierarchical Parallelism
• Expose more parallelism through code refactoring and source code directives
– Doubles CPU performance of many codes
• Use right type of processor for each task
• Data locality: Keep data near processing
– GPU has high bandwidth to local memory for rapid access
– GPU has large internal cache
• Explicit data management:
– Explicitly manage data movement between CPU and GPU memories
CPU GPU Accelerator
• Optimized for latency and sequential multitasking
• Optimized for throughput and many simultaneous tasks
• 10 performance per socket
• 5 more energy-efficient systems
![Page 16: Next Generation of Co-Processors Emerges: In-Network Computingsc16.supercomputing.org/sc-archive/bof/bof_files/bof116s2.pdf · The latest revolution in HPC is the move to a co-design](https://reader036.vdocument.in/reader036/viewer/2022070712/5ece2affee11c142a623d765/html5/thumbnails/16.jpg)
16 From Titan to Summit: Accelerating Everything
ORNL’s “Titan” Hybrid System:Cray XK7 with AMD Opteron and NVIDIA Tesla processors
4,352 ft2
404 m2
SYSTEM SPECIFICATIONS:
• Peak performance of 27.1 PF (24.5 + 2.6)
• 18,688 Compute Nodes each with:
• 16-Core AMD Opteron CPU (32 GB)
• NVIDIA Tesla “K20x” GPU (6 GB)
• 200 Cabinets
• 710 TB total system memory
• Cray Gemini 3D Torus Interconnect
Jaguar→Titan11.8x Faster1.3x Power
![Page 17: Next Generation of Co-Processors Emerges: In-Network Computingsc16.supercomputing.org/sc-archive/bof/bof_files/bof116s2.pdf · The latest revolution in HPC is the move to a co-design](https://reader036.vdocument.in/reader036/viewer/2022070712/5ece2affee11c142a623d765/html5/thumbnails/17.jpg)
17 From Titan to Summit: Accelerating Everything
Next-gen systems: Cori, Aurora, SummitThe many-core, hybrid generation
Attributes 2016 Cori (NERSC) 2018 Aurora (Argonne) 2018 Summit (Oak Ridge)
Peak PF >30 180 >180
Power MW <3.7 13 13
Processors Intel Xeon Phi (KNL) and Haswell Intel Xeon Phi (KNH) IBM Power9 + NVIDIA Volta
Sys. Mem.~1PB DDR4 + HBM +
1.5PB persistent>7 PB HBM + Local + persistent
>6 PB HBM + Local +
persistent
Nodes 9,300 compute + 1,900 data > 50,000 ~4,600
File System 28 PB @ 744 GB/s Lustre 150 PB @ 1 TB/s Lustre 250 PB @ 2.5 TB/s GPFS
![Page 18: Next Generation of Co-Processors Emerges: In-Network Computingsc16.supercomputing.org/sc-archive/bof/bof_files/bof116s2.pdf · The latest revolution in HPC is the move to a co-design](https://reader036.vdocument.in/reader036/viewer/2022070712/5ece2affee11c142a623d765/html5/thumbnails/18.jpg)
18 From Titan to Summit: Accelerating Everything
2017 OLCF Leadership SystemHybrid CPU/GPU architecture
• Vendor: IBM (Prime) / NVIDIA™ / Mellanox Technologies®
• 5-10X Titan’s Application Performance
• Approximately 4,600 nodes, each with:
– Two IBM POWER9 CPUs and six NVIDIA Tesla® GPUs using the NVIDIA Volta architecture
– CPUs and GPUs connected with high speed NVLink 2.0
– Large coherent memory: over 512 GB (HBM + DDR4)
• All directly addressable from the CPUs and GPUs
– An additional 800 GB of NVRAM, which can be configured as either a burst buffer or as extended memory
– Over 40 TF peak performance
• Dual-rail Mellanox® EDR or HDR full, non-blocking fat-tree interconnect
• IBM Elastic Storage (GPFS™) – 2.5 TB/s I/O and 250 PB usable capacity.
Titan→Summit5-10x Faster1.4x Power
![Page 19: Next Generation of Co-Processors Emerges: In-Network Computingsc16.supercomputing.org/sc-archive/bof/bof_files/bof116s2.pdf · The latest revolution in HPC is the move to a co-design](https://reader036.vdocument.in/reader036/viewer/2022070712/5ece2affee11c142a623d765/html5/thumbnails/19.jpg)
19 From Titan to Summit: Accelerating Everything
How does Summit compare to Titan?
Feature Summit Titan
Application Performance 5-10x Titan Baseline
Number of Nodes ~4,600 18,688
Node performance > 40 TF (~30x) 1.4 TF
Memory per Node >512 GB (HBM + DDR4) (~14x) 38GB (GDDR5+DDR3)
NVRAM per Node 800 GB (2.8x IO vs PFS) 0
Node Interconnect NVLink 2 (5-12x PCIe 3, 10-24x PCIe 2) PCIe 2
System Interconnect
Node injection bandwidthDual Rail EDR 25 GB/s (3.9x) or
Dual Rail HDR 50 GB/s (7.8x)
Gemini
6.4 GB/s
Interconnect Topology
Bisection Bandwidth
Max Hops/Latency 1/2RTT (µs)
Barrier Latency (µs)
Non-blocking Fat Tree
EDR 78.2 TB/s (~14x)
5 hops/~3 µs (~15x)
~4 µs (23x)
3D Torus
5.6 TB/s
33 hops/46 µs
92 µs
Peak power consumption 13 MW 9 MW
Network Accelerated Collectives
![Page 20: Next Generation of Co-Processors Emerges: In-Network Computingsc16.supercomputing.org/sc-archive/bof/bof_files/bof116s2.pdf · The latest revolution in HPC is the move to a co-design](https://reader036.vdocument.in/reader036/viewer/2022070712/5ece2affee11c142a623d765/html5/thumbnails/20.jpg)
20 From Titan to Summit: Accelerating Everything
Acknowledgements
• Buddy Bland, OLCF Project Director
• Dr. Jack Wells, OLCF Director of Science
• OLCF Users & Staff
• ORNL CSMD Partners: Oscar Hernandez, M. Graham Lopez
• Titan Vendor Partners: Cray, AMD, NVIDIA
• Summit Vendor Partners: IBM, NVIDIA, Mellanox
This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National
Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No.
DE-AC05-00OR22725.
![Page 21: Next Generation of Co-Processors Emerges: In-Network Computingsc16.supercomputing.org/sc-archive/bof/bof_files/bof116s2.pdf · The latest revolution in HPC is the move to a co-design](https://reader036.vdocument.in/reader036/viewer/2022070712/5ece2affee11c142a623d765/html5/thumbnails/21.jpg)
© 2016 Mellanox Technologies 21
The latest revolution in HPC is the move to a co-design architecture, a collaborative effort among industry,
academia, and manufacturers to reach exascale performance by taking a holistic system-level approach to
fundamental performance improvements. Co-design architecture exploits system efficiency and optimizes
performance by creating synergies between the hardware and the software.
Co-design recognizes that the CPU has reached the limits of its scalability and offers an intelligent network
as the new “co-processor” to share the responsibility for handling and accelerating application workloads. By
placing data-related algorithms on an intelligent network, we can dramatically improve the data center and
applications performance.
Next Generation of Co-Processors Emerges: In-Network Computing