next generation of co-processors emerges: in-network...

© 2016 Mellanox Technologies 1

The latest revolution in HPC is the move to a co-design architecture, a collaborative effort among industry,

academia, and manufacturers to reach exascale performance by taking a holistic system-level approach to

fundamental performance improvements. Co-design architecture exploits system efficiency and optimizes

performance by creating synergies between the hardware and the software.

Co-design recognizes that the CPU has reached the limits of its scalability and offers an intelligent network

as the new “co-processor” to share the responsibility for handling and accelerating application workloads. By

placing data-related algorithms on an intelligent network, we can dramatically improve the data center and

applications performance.

Next Generation of Co-Processors Emerges: In-Network Computing

Smart Interconnects:

The Next Key Driver of HPC Performance Gains

2

“For Greater HPC Performance Doing More of the Same Will Not Cut It”

Bob Sorensen

VP Research, HPC Group @ IDC

“Smart Interconnect Promises a Way Forward”


The Ever Growing Demand for Higher Performance

2000 202020102005

“Roadrunner”

1st

2015

Terascale Petascale Exascale

Single-Core to Many-CoreSMP to Clusters

Performance Development

Co-Design

HW SW

APP

Hardware

Software

Application

The Interconnect is the Enabling Technology


Exponential Data Growth – The Need for Intelligent and Faster Interconnect

CPU-Centric (Onload) Data-Centric (Offload)

Faster Data Speeds and In-Network Computing Enable Higher Performance and Scale

Must Wait for the Data

Creates Performance BottlenecksAnalyze Data as it Moves!


Breaking the Application Latency Wall

Today: Network device latencies are on the order of 100 nanoseconds

Challenge: Enabling the next order of magnitude improvement in application performance

Solution: Creating synergies between software and hardware – intelligent interconnect

Intelligent Interconnect Paves the Road to Exascale Performance

10 years ago

~10

microsecond

~100

microsecond

NetworkCommunication

Framework

Today

~10

microsecond

Communication

Framework

~0.1

microsecond

Network

~1

microsecond

Communication

Framework

Future

~0.05

microsecond

Co-Design

Network


In-Network Computing and Acceleration Engines

RDMA Collectives

Tag Matching Security

Storage

Network

Transport

Most Efficient Data Access and Data

Movement for Compute and Storage

platforms

200G with <1%CPU Utilization

CORE-Direct and SHArP Technologies

Executes and Manages Collective

Communications

Accelerates MPI, PGAS/SHMEM and UPC,

Map/Reduce and more

Communication Performance

MPI Tag-Matching Offload

MPI Rendezvous Protocol Offload

Accelerates MPI Application Performance

All Communications Managed and Operated by

the Network Hardware; Adaptive Routing and

Congestion Management

Maximizes CPU Availability for Applications,

increases Network Efficiency

NVMe over Fabrics Offloads, T10-DIF and

Erasure Coding offloads

Efficient End-to-End Data Protection,

Background Check-Pointing (burst-buffer) and

More. Increase System Performance and CPU

Availability

Data Encryption / Decryption (IEEE XTS standard)

and Key Management; Federal Information

Processing Standards (FIPS) Compliant

Enhances Data Security Options, Enables

Protection Between Users Sharing the Same

Resources (Different Keys)


Scalable Hierarchical Aggregation Protocol (SHArP)

Reliable Scalable General Purpose Primitive

• In-network Tree based aggregation mechanism

• Large number of groups

• Multiple simultaneous outstanding operations

Applicable to Multiple Use-cases

• HPC Applications using MPI / SHMEM

• Distributed Machine Learning applications

Scalable High Performance Collective Offload

• Barrier, Reduce, All-Reduce, Broadcast

• Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND

• Integer and Floating-Point, 32 / 64 bit

SHArP Tree

SHArP Tree Aggregation Node

(Process running on HCA)

SHArP Tree Endnode

(Process running on HCA)

SHArP Tree Root


SHArP Performance Advantage

MiniFE is a Finite Element mini-application

• Implements kernels that represent

implicit finite-element applications

Allreduce MPI Collective


HPC-X with SHArP Technology

OpenFOAM is a popular computational fluid dynamics application

ORNL is managed by UT-Battelle

for the US Department of Energy

From Titan to Summit:Accelerating Everything

Scott Atchley

System Architecture Team Lead

Technology Integration Group

Oak Ridge Leadership Computing Facility

Oak Ridge National Laboratory

11 From Titan to Summit: Accelerating Everything

Outline

• Mission of the Oak Ridge Leadership Computing Facility

• Titan – Accelerating Computing

• Summit – Accelerating Everything


US Department of Energy’s Office of Science Computation User Facilities

• DOE is leader in open High-Performance Computing

• Provide the world’s most powerful computational tools for open science

• Access is free to researchers who publish

• Boost US competitiveness

• Attract the best and brightest researchers

NERSC Edison is 2.57 PF

OLCFTitan is 27 PF

ALCFMira is 10 PF


Oak Ridge Leadership Computing Facility (OLCF)Mission: Deploy and operate the computational resources required to address global challenges

Providing world-leading computational and data resources and specialized services for the most computationally intensive problems

Providing stable hardware/software path of increasing scale to maximize productive applications development

Providing the resources to investigate otherwise inaccessible systems at every scale: from galaxy formation to supernovae to earth systems to jet engines to automobiles to nanomaterials

With our partners, deliver transforming discoveries in materials, biology, climate, energy technologies, and basic science

CORAL SystemJaguar: 2.3 PFMulti-core CPU7 MW

Titan: 27 PFHybrid GPU/CPU9 MW

2010 2013 2017 2022

OLCF5: 5-10x Summit~20 MWSummit: 5-10x Titan

Hybrid GPU/CPU10 MW

Exascale System


Path to Accelerated Computing

• In 2009, Jaguar’s peak performance was 2.3 PF and it used 7MW of power

• Users needed 10x performance in the next generation system to meet their science goals

• We could get 2x with CPUs – Too much power to get to 10x


GPUs provided a path forward using Hierarchical Parallelism

• Expose more parallelism through code refactoring and source code directives

– Doubles CPU performance of many codes

• Use right type of processor for each task

• Data locality: Keep data near processing

– GPU has high bandwidth to local memory for rapid access

– GPU has large internal cache

• Explicit data management:

– Explicitly manage data movement between CPU and GPU memories

CPU GPU Accelerator

• Optimized for latency and sequential multitasking

• Optimized for throughput and many simultaneous tasks

• 10 performance per socket

• 5 more energy-efficient systems


ORNL’s “Titan” Hybrid System:Cray XK7 with AMD Opteron and NVIDIA Tesla processors

4,352 ft2

404 m2

SYSTEM SPECIFICATIONS:

• Peak performance of 27.1 PF (24.5 + 2.6)

• 18,688 Compute Nodes each with:

• 16-Core AMD Opteron CPU (32 GB)

• NVIDIA Tesla “K20x” GPU (6 GB)

• 200 Cabinets

• 710 TB total system memory

• Cray Gemini 3D Torus Interconnect

Jaguar→Titan11.8x Faster1.3x Power


Next-gen systems: Cori, Aurora, SummitThe many-core, hybrid generation

Attributes 2016 Cori (NERSC) 2018 Aurora (Argonne) 2018 Summit (Oak Ridge)

Peak PF >30 180 >180

Power MW <3.7 13 13

Processors Intel Xeon Phi (KNL) and Haswell Intel Xeon Phi (KNH) IBM Power9 + NVIDIA Volta

Sys. Mem.~1PB DDR4 + HBM +

1.5PB persistent>7 PB HBM + Local + persistent

>6 PB HBM + Local +

persistent

Nodes 9,300 compute + 1,900 data > 50,000 ~4,600

File System 28 PB @ 744 GB/s Lustre 150 PB @ 1 TB/s Lustre 250 PB @ 2.5 TB/s GPFS


2017 OLCF Leadership SystemHybrid CPU/GPU architecture

• Vendor: IBM (Prime) / NVIDIA™ / Mellanox Technologies®

• 5-10X Titan’s Application Performance

• Approximately 4,600 nodes, each with:

– Two IBM POWER9 CPUs and six NVIDIA Tesla® GPUs using the NVIDIA Volta architecture

– CPUs and GPUs connected with high speed NVLink 2.0

– Large coherent memory: over 512 GB (HBM + DDR4)

• All directly addressable from the CPUs and GPUs

– An additional 800 GB of NVRAM, which can be configured as either a burst buffer or as extended memory

– Over 40 TF peak performance

• Dual-rail Mellanox® EDR or HDR full, non-blocking fat-tree interconnect

• IBM Elastic Storage (GPFS™) – 2.5 TB/s I/O and 250 PB usable capacity.

Titan→Summit5-10x Faster1.4x Power


How does Summit compare to Titan?

Feature Summit Titan

Application Performance 5-10x Titan Baseline

Number of Nodes ~4,600 18,688

Node performance > 40 TF (~30x) 1.4 TF

Memory per Node >512 GB (HBM + DDR4) (~14x) 38GB (GDDR5+DDR3)

NVRAM per Node 800 GB (2.8x IO vs PFS) 0

Node Interconnect NVLink 2 (5-12x PCIe 3, 10-24x PCIe 2) PCIe 2

System Interconnect

Node injection bandwidthDual Rail EDR 25 GB/s (3.9x) or

Dual Rail HDR 50 GB/s (7.8x)

Gemini

6.4 GB/s

Interconnect Topology

Bisection Bandwidth

Max Hops/Latency 1/2RTT (µs)

Barrier Latency (µs)

Non-blocking Fat Tree

EDR 78.2 TB/s (~14x)

5 hops/~3 µs (~15x)

~4 µs (23x)

3D Torus

5.6 TB/s

33 hops/46 µs

92 µs

Peak power consumption 13 MW 9 MW

Network Accelerated Collectives


Acknowledgements

• Buddy Bland, OLCF Project Director

• Dr. Jack Wells, OLCF Director of Science

• OLCF Users & Staff

• ORNL CSMD Partners: Oscar Hernandez, M. Graham Lopez

• Titan Vendor Partners: Cray, AMD, NVIDIA

• Summit Vendor Partners: IBM, NVIDIA, Mellanox

This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National

Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No.

DE-AC05-00OR22725.


The latest revolution in HPC is the move to a co-design architecture, a collaborative effort among industry,

academia, and manufacturers to reach exascale performance by taking a holistic system-level approach to

fundamental performance improvements. Co-design architecture exploits system efficiency and optimizes

performance by creating synergies between the hardware and the software.

Co-design recognizes that the CPU has reached the limits of its scalability and offers an intelligent network

as the new “co-processor” to share the responsibility for handling and accelerating application workloads. By

placing data-related algorithms on an intelligent network, we can dramatically improve the data center and

applications performance.

Next Generation of Co-Processors Emerges: In-Network Computing

next generation of co-processors emerges: in-network...

Documents