mellanox in-network computing for ai and the development ......tree multiple sharp trees are...

27
1 © 2019 Mellanox Technologies | Confidential Qingchun Song, Mellanox July 2, 2019 Mellanox In-Network Computing For AI and The Development With NVIDIA (SHARP – NCCL)

Upload: others

Post on 20-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mellanox In-Network Computing For AI and The Development ......Tree Multiple SHArP Trees are Supported Each SHArP Tree can handle Multiple Outstanding SHArP Operations Within a SHArP

1© 2019 Mellanox Technologies | Confidential

Qingchun Song, MellanoxJuly 2, 2019

Mellanox In-Network Computing For AI and The Development With NVIDIA (SHARP – NCCL)

Page 2: Mellanox In-Network Computing For AI and The Development ......Tree Multiple SHArP Trees are Supported Each SHArP Tree can handle Multiple Outstanding SHArP Operations Within a SHArP

2© 2019 Mellanox Technologies | Confidential

Compute-Centric Data-Centric

Data Processing Revolution – Data Centric

DataFlowMachine

Von Neumann Machine

Page 3: Mellanox In-Network Computing For AI and The Development ......Tree Multiple SHArP Trees are Supported Each SHArP Tree can handle Multiple Outstanding SHArP Operations Within a SHArP

3© 2019 Mellanox Technologies | Confidential

CPU-Centric HPC/AI Center

Everything

CPU Network Storage

Page 4: Mellanox In-Network Computing For AI and The Development ......Tree Multiple SHArP Trees are Supported Each SHArP Tree can handle Multiple Outstanding SHArP Operations Within a SHArP

4© 2019 Mellanox Technologies | Confidential

Data-Centric HPC/AI Center

Workload

Network Functions

Communication Framework (MPI)

Workload Workload

In-CPU Computing In-Network Computing In-Storage Computing

Storage Functions

CPU Functions

Page 5: Mellanox In-Network Computing For AI and The Development ......Tree Multiple SHArP Trees are Supported Each SHArP Tree can handle Multiple Outstanding SHArP Operations Within a SHArP

5© 2019 Mellanox Technologies | Confidential

In-Network Computing Architecture

CPU-Centric (Onload) Data-Centric (Offload)

Communications Latencies of 30-40us

GPU

CPU

GPU

CPU

GPU

CPU

CPU

GPU

GPU

CPU

GPU

CPU

GPU

CPU

CPU

GPU

Communications Latenciesof 3-4us

Page 6: Mellanox In-Network Computing For AI and The Development ......Tree Multiple SHArP Trees are Supported Each SHArP Tree can handle Multiple Outstanding SHArP Operations Within a SHArP

6© 2019 Mellanox Technologies | Confidential

In-Network Computing to Enable Data-Centric Data Centers

GPU

CPU

GPU

CPU

GPU

CPU

CPU

GPU

GPUDirect

RDMA

Scalable Hierarchical Aggregation and

Reduction Protocol

NVMeOver

Fabrics

Page 7: Mellanox In-Network Computing For AI and The Development ......Tree Multiple SHArP Trees are Supported Each SHArP Tree can handle Multiple Outstanding SHArP Operations Within a SHArP

7© 2019 Mellanox Technologies | Confidential

In-Network Computing Connects the World’s Fastest HPC and AI Supercomputer

• Summit CORAL System, World’s Fastest HPC / AI System

• Nvidia V100 GPU + InfiniBand HCA + In-Network Computing Fabric

Page 8: Mellanox In-Network Computing For AI and The Development ......Tree Multiple SHArP Trees are Supported Each SHArP Tree can handle Multiple Outstanding SHArP Operations Within a SHArP

8© 2019 Mellanox Technologies | Confidential

GPUDirect RDMA Technology and Advantages

Page 9: Mellanox In-Network Computing For AI and The Development ......Tree Multiple SHArP Trees are Supported Each SHArP Tree can handle Multiple Outstanding SHArP Operations Within a SHArP

9© 2019 Mellanox Technologies | Confidential

Scalable Hierarchical Aggregation And

Reduction Protocol(SHARP)

Page 10: Mellanox In-Network Computing For AI and The Development ......Tree Multiple SHArP Trees are Supported Each SHArP Tree can handle Multiple Outstanding SHArP Operations Within a SHArP

10© 2019 Mellanox Technologies | Confidential

Accelerating HPC and AI Applications

Accelerating HPC Applications▪ Significantly reduce MPI collective runtime ▪ Increase CPU availability and efficiency ▪Enable communication and computation overlap

Enabling Artificial Intelligence Solutions to Perform Critical and Timely Decision Making ▪Accelerating distributed machine learning

Page 11: Mellanox In-Network Computing For AI and The Development ......Tree Multiple SHArP Trees are Supported Each SHArP Tree can handle Multiple Outstanding SHArP Operations Within a SHArP

11© 2019 Mellanox Technologies | Confidential

AllReduce Example – Trees

▪ Many2One and One2Many traffic patterns – possible network congestion▪ Probably not a good solution for large data▪ Large scale requires higher tree / larger radix ▪ Result distribution – over the tree / MC

Switch

EndNode

Stage1

Stage2

Page 12: Mellanox In-Network Computing For AI and The Development ......Tree Multiple SHArP Trees are Supported Each SHArP Tree can handle Multiple Outstanding SHArP Operations Within a SHArP

12© 2019 Mellanox Technologies | Confidential

AllReduce (Example) - Recursive Doubling

▪ The data is recursively divided, processed by CPUs and distributed▪ The rank’s CPUs are occupied performing the reduce algorithm▪ The data is sent at least 2x times, consumes at least twice the BW

Rank 1 Rank 2 Rank 3 Rank 4

Step 1

Step 2

Step 3

Step 4

½ Data

¼ Data

¼ Data

½ Data

Calculation phase

Result sending phase

Page 13: Mellanox In-Network Computing For AI and The Development ......Tree Multiple SHArP Trees are Supported Each SHArP Tree can handle Multiple Outstanding SHArP Operations Within a SHArP

13© 2019 Mellanox Technologies | Confidential

Scalable Hierarchical Aggregation Protocol

Reliable Scalable General Purpose Primitive, Applicable to Multiple Use-cases▪ In-network Tree based aggregation mechanism▪ Large number of groups▪ Multiple simultaneous outstanding operations ▪ Streaming aggregation

Accelerating HPC applications▪ Scalable High Performance Collective Offload

▪ Barrier, Reduce, All-Reduce, Broadcast▪ Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND▪ Integer and Floating-Point, 16 / 32 / 64 bit▪ Up to 1KB payload size (in Quantum)

▪ Significantly reduce MPI collective runtime ▪ Increase CPU availability and efficiency ▪ Enable communication and computation overlap

Accelerating Machine Learning applications▪ Proven the many-to-one Traffic Pattern▪ CUDA , GPUDirect RDMA

SHArP Tree

SHArP Tree Aggregation Node

(Process running on HCA)

SHArP Tree Endnode

(Process running on HCA)

SHArP Tree Root

Page 14: Mellanox In-Network Computing For AI and The Development ......Tree Multiple SHArP Trees are Supported Each SHArP Tree can handle Multiple Outstanding SHArP Operations Within a SHArP

14© 2019 Mellanox Technologies | Confidential

Scalable Hierarchical Aggregation Protocol

▪ SHARP Tree is a Logical Construct▪ Nodes in the SHArP Tree are IB Endnodes▪ Logical tree defined on top of the physical

underlying fabric▪ SHArP Tree Links are implemented on top of the IB

transport (Reliable Connection)▪ Expected to follow the physical topology for

performance but not required

▪ SHARP Operations are Executed by a SHARP Tree▪ Multiple SHArP Trees are Supported▪ Each SHArP Tree can handle Multiple Outstanding

SHArP Operations▪ Within a SHArP Tree, each Operation is Uniquely

Identified by a SHArP-Tuple▪ GroupID▪ SequenceNumber

Physical Topology

One SHArP Tree

Switch/Router

HCA

SHArP Tree Aggregation Node (Process running on HCA)

SHArP Tree Endnode(Process running on HCA)

SHArP Tree Root

Page 15: Mellanox In-Network Computing For AI and The Development ......Tree Multiple SHArP Trees are Supported Each SHArP Tree can handle Multiple Outstanding SHArP Operations Within a SHArP

15© 2019 Mellanox Technologies | Confidential

SHARP Principles of Operation - Request

SHArP Tree Root

Aggregation Request

Page 16: Mellanox In-Network Computing For AI and The Development ......Tree Multiple SHArP Trees are Supported Each SHArP Tree can handle Multiple Outstanding SHArP Operations Within a SHArP

16© 2019 Mellanox Technologies | Confidential

SHARP Principles of Operation – Response

SHArP Tree Root

Aggregation Response

Page 17: Mellanox In-Network Computing For AI and The Development ......Tree Multiple SHArP Trees are Supported Each SHArP Tree can handle Multiple Outstanding SHArP Operations Within a SHArP

17© 2019 Mellanox Technologies | Confidential

NCCL Ring

▪ Simple▪ Linear Latency▪ Support in NCCL-2.3 & previous version▪ Multiple rings

Switch Switch

Page 18: Mellanox In-Network Computing For AI and The Development ......Tree Multiple SHArP Trees are Supported Each SHArP Tree can handle Multiple Outstanding SHArP Operations Within a SHArP

18© 2019 Mellanox Technologies | Confidential

NCCL Tree

Network Fabric

NIC NIC NIC

▪ Support added in NCCL-2.4▪Keep Intra-node chain▪Node leaders participate in tree▪Binary double tree▪Multiple rings -> Multiple trees

Page 19: Mellanox In-Network Computing For AI and The Development ......Tree Multiple SHArP Trees are Supported Each SHArP Tree can handle Multiple Outstanding SHArP Operations Within a SHArP

19© 2019 Mellanox Technologies | Confidential

NCCL SHARPNetwork Fabric

NIC NIC NIC

Page 20: Mellanox In-Network Computing For AI and The Development ......Tree Multiple SHArP Trees are Supported Each SHArP Tree can handle Multiple Outstanding SHArP Operations Within a SHArP

20© 2019 Mellanox Technologies | Confidential

NCCL SHARP

▪ Collective network Plugin▪ Replace Inter-node tree with SHARP Tree▪ Keeps Intra-node ring▪ Aggregation in network switch ▪ Streaming from GPU memory with GPU Direct RDMA▪ 2x BW compared to NCCL-TREE

SHARP Enables 2X Higher Data Throughput for NCCL

0

5

10

15

20

25

8

16

32

64

12

8

25

6

51

2

10

24

20

48

Ban

dw

idth

( G

B/s

)

Message Size(MB)

NCCL Allreduce Bandwidth - 1 Ring

SHARP Ring Tree

Page 21: Mellanox In-Network Computing For AI and The Development ......Tree Multiple SHArP Trees are Supported Each SHArP Tree can handle Multiple Outstanding SHArP Operations Within a SHArP

21© 2019 Mellanox Technologies | Confidential

SHARP AllReduce Performance Advantages (128 Nodes)

SHARP enables 75% Reduction in LatencyProviding Scalable Flat LatencyScalable Hierarchical

Aggregation and Reduction Protocol

Page 22: Mellanox In-Network Computing For AI and The Development ......Tree Multiple SHArP Trees are Supported Each SHArP Tree can handle Multiple Outstanding SHArP Operations Within a SHArP

22© 2019 Mellanox Technologies | Confidential

SHARP AllReduce Performance Advantages 1500 Nodes, 60K MPI Ranks, Dragonfly+ Topology

SHARP Enables Highest PerformanceScalable Hierarchical Aggregation and

Reduction Protocol

Page 23: Mellanox In-Network Computing For AI and The Development ......Tree Multiple SHArP Trees are Supported Each SHArP Tree can handle Multiple Outstanding SHArP Operations Within a SHArP

23© 2019 Mellanox Technologies | Confidential

SHARP Performance – Application (OSU)

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

The MVAPICH2 Projecthttp://mvapich.cse.ohio-state.edu/

Source: Prof. DK Panda, Ohio State University

Page 24: Mellanox In-Network Computing For AI and The Development ......Tree Multiple SHArP Trees are Supported Each SHArP Tree can handle Multiple Outstanding SHArP Operations Within a SHArP

24© 2019 Mellanox Technologies | Confidential

SHARP Performance Advantage for AI

▪ SHARP provides 16% Performance Increase for deep learning, initial results▪ TensorFlow with Horovod running ResNet50 benchmark, HDR InfiniBand (ConnectX-6, Quantum)

16%

11%

P100 NVIDIA GPUs, RH 7.5, Mellanox OFED 4.4, HPC-X v2.3, TensorFlow v1.11, Horovod 0.15.0

Page 25: Mellanox In-Network Computing For AI and The Development ......Tree Multiple SHArP Trees are Supported Each SHArP Tree can handle Multiple Outstanding SHArP Operations Within a SHArP

25© 2019 Mellanox Technologies | Confidential

NCCL-SHARP Performance – DL Training

18755

20456

0

5000

10000

15000

20000

25000

NCCL NCCL-SHARP

Imag

es/s

ec

ResNet50

System Configuration: (4) HPE Apollo 6500 systems configured with (8) NVIDIA Tesla V100 SXM2 16GB, (2) HPE DL360 Gen10 Intel Xeon-Gold 6134 (3.2 GHz/8-core/130 W) CPUs, (24) DDR4-2666 CAS-19-19-19 Registered Memory ModulesHPE 1.6 TB NVMe SFF (2.5") SSD, ConnectX-6 HCA, IB Quantum Switch (EDR speed), Ubuntu 16.04

Page 26: Mellanox In-Network Computing For AI and The Development ......Tree Multiple SHArP Trees are Supported Each SHArP Tree can handle Multiple Outstanding SHArP Operations Within a SHArP

26© 2019 Mellanox Technologies | Confidential

GPUDirect

RDMA

Network

Communication

Application▪ Data Analysis

▪ Real Time

▪ Deep Learning

▪ Mellanox SHARP In-Network Computing

▪ MPI Tag Matching

▪ MPI Rendezvous

▪ Software Defined Virtual Devices

▪ Network Transport Offload

▪ RDMA and GPU-Direct RDMA

▪ SHIELD (Self-Healing Network)

▪ Enhanced Adaptive Routing and Congestion Control

Connectivity▪ Multi-Host Technology

▪ Socket-Direct Technology

▪ Enhanced Topologies

Accelerating All Levels of HPC / AI Frameworks

Page 27: Mellanox In-Network Computing For AI and The Development ......Tree Multiple SHArP Trees are Supported Each SHArP Tree can handle Multiple Outstanding SHArP Operations Within a SHArP

27© 2019 Mellanox Technologies | Confidential

Thank You