highest peformance and scalability for hpc and ai

23
1 © 2018 Mellanox Technologies | Confidential Mellanox – The Intelligence Interconnect Company February 2018 Highest Performance and Scalability for HPC and AI

Upload: inside-bigdatacom

Post on 15-Mar-2018

161 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Highest Peformance and Scalability for HPC and AI

1© 2018 Mellanox Technologies | Confidential

Mellanox – The Intelligence Interconnect CompanyFebruary 2018

Highest Performance and Scalability for HPC and AI

Page 2: Highest Peformance and Scalability for HPC and AI

2© 2018 Mellanox Technologies | Confidential

NVMe, NVMe Over Fabrics

Big Data Storage

Hadoop / Spark

Software Defined Storage

SQL & NoSQL Database

Deep Learning

HPC

GPUDirect

RDMA

MPI

NCCLSHARP

Image , Video Recognition

Sentiment Analysis

Fraud & Flaw

Detection

Voice Recognition & Search

Same Interconnect Technology Enables a Variety of Applications

Page 3: Highest Peformance and Scalability for HPC and AI

3© 2018 Mellanox Technologies | Confidential

Accelerating the Next Generation of HPC and AI

Summit CORAL System

Fastest Supercomputer in Japan

Fastest Supercomputer in CanadaInfiniBand Dragonfly+ Topology

Sierra CORAL System

Highest Performance with InfiniBand

Page 4: Highest Peformance and Scalability for HPC and AI

4© 2018 Mellanox Technologies | Confidential

Higher Data SpeedsFaster Data Processing

Better Data Security

Adapters SwitchesCables &

Transceivers

SmartNIC System on a Chip

HPC and AI Needs the Most Intelligent Interconnect

Page 5: Highest Peformance and Scalability for HPC and AI

5© 2018 Mellanox Technologies | Confidential

Cloud Big Data

Enterprise

Business Intelligence

HPC

Storage

Security

Machine Learning

Internet of Things

Exponential Data Growth Everywhere

Page 6: Highest Peformance and Scalability for HPC and AI

6© 2018 Mellanox Technologies | Confidential

Big Data? No… REALLY BIG DATA

Average data generated in a self-driving vehicle is expected to reach 40TB for every eight hours of driving (this mostly applies to full service fleet vehicles)

The Pratt & Whitney PW1000G engine has 5,000 sensors installed, generating about 10 GB of data per second. With an average 12-hr. flight-time can produce up to 844 TB of data

Mellanox is the de-facto interconnect for deep learning deployments

Page 7: Highest Peformance and Scalability for HPC and AI

7© 2018 Mellanox Technologies | Confidential

What’s Important For ML & Big Data Networking?

High Bandwidth and Low Latency Massive amounts of data requires thick data-pipes NVIDIA DGX-1 use 4 x 100G InfiniBand Port latency < 90 nsec with Mellanox EDR InfiniBand Spectrum is the world’s lowest latency Ethernet switch

No Packet Loss! Mellanox Spectrum switches have ZERO packet loss

Offloads – Free up the CPU! Stateless Offloads RDMA GPUDirect RDMA, GPUDirect ASYNC Virtualization and Containers Networking

GPUDirect™ RDMA

In-Network Computing SHARP

Storage Acceleration RDMA NVMe over Fabrics Acceleration Erasure Coding Acceleration

Security IPsec, TLS Offloads

Resiliency Robust End-to-End Network Solutions

Page 8: Highest Peformance and Scalability for HPC and AI

8© 2018 Mellanox Technologies | Confidential

Data Centric Architecture to Overcome Latency Bottlenecks

Communications Latencies of 30-40us Communications Latencies of 3-4us

CPU-Centric (Onload) Data-Centric (Offload)

Network In-Network Computing

Intelligent Interconnect Paves the Road to Exascale Performance

Page 9: Highest Peformance and Scalability for HPC and AI

9© 2018 Mellanox Technologies | Confidential

In-Network Computing

Self-Healing Technology

In-Network Computing

Unbreakable Data Centers

Delivers Highest Application Performance

GPUDirect™ RDMA

Critical for HPC and Machine Learning ApplicationsGPU Acceleration Technology

10X Performance AccelerationCritical for HPC and Machine Learning Applications

35XFaster Network Recovery5000X

10X Performance Acceleration

Performance Acceleration

In-Network Computing Delivers Highest Performance

Page 10: Highest Peformance and Scalability for HPC and AI

10© 2018 Mellanox Technologies | Confidential

30%-250% Higher Return on Investment

Up to 50% Saving on Capital and Operation Expenses

Highest Applications Performance, Scalability and Productivity

InfiniBand Delivers Best Return on Investment

1.9X Better 2X Better 1.4X Better 2.5X Better 1.3X Better

Molecular DynamicsChemistry Automotive GenomicsWeather

Page 12: Highest Peformance and Scalability for HPC and AI

12© 2018 Mellanox Technologies | Confidential

10X Higher Performance with GPUDirect™ RDMA Technology

GPUDirect™ RDMA

Purpose-built for Acceleration of Deep Learning

Lowest communication latency for acceleration devices

No unnecessary system memory copies and CPU overhead

Page 13: Highest Peformance and Scalability for HPC and AI

13© 2018 Mellanox Technologies | Confidential

GPUDirect™ ASYNC

GPUDirect RDMA (3.0) – direct data path between the GPU and Mellanox interconnect Control path still uses the CPU

CPU prepares and queues communication tasks on GPU GPU triggers communication on HCA Mellanox HCA directly accesses GPU memory

GPUDirect ASYNC (GPUDirect 4.0) Both data path and control path go directly

between the GPU and the Mellanox interconnect

Maximum PerformanceFor GPU Clusters

Page 14: Highest Peformance and Scalability for HPC and AI

14© 2018 Mellanox Technologies | Confidential

Distributed Training

Training on large data sets can take a long time In some cases weeks

In many cases training need to happen frequently Model development and tuning Real life use cases may require retraining regularly

Accelerate training time by scale out architecture Add workers (nodes) to reduce training time

Data Parallelism is the solution

Network is critical element to accelerate Distributed Training!

Page 15: Highest Peformance and Scalability for HPC and AI

15© 2018 Mellanox Technologies | Confidential

RDMA Accelerates Distributed Training

Data Parallelism Communication Workers to parameter servers and parameter servers to workers Frequent Maybe high bandwidth Bursty Point to point and collectives

RDMA Improves Point to Point and Collective operations

Page 16: Highest Peformance and Scalability for HPC and AI

16© 2018 Mellanox Technologies | Confidential

Model Parallelism

Models size is limited by the compute engine (GPU for example)

In some cases the model doesn’t fit in compute Large models Small compute such as FPGAs

Model parallelism slice the model and run each part on different compute engine

Networking become a critical element

High bandwidth, low latency and RDMA are mandatory

Data Data

Page 17: Highest Peformance and Scalability for HPC and AI

17© 2018 Mellanox Technologies | Confidential

Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Reliable Scalable General Purpose Primitive

In-network tree-based aggregation mechanism Large number of groups Multiple simultaneous outstanding operations

Applicable to Multiple Use-cases HPC applications using MPI / SHMEM Distributed Machine Learning applications

Scalable High Performance Collective Offload Barrier, Reduce, All-Reduce, Broadcast and more Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND Integer and Floating-Point, 16/32/64/128 bits

SHARP Tree

SHARP Tree Aggregation Node

(Process running on HCA)

SHARP Tree Endnode

(Process running on HCA)

SHARP Tree Root

Page 18: Highest Peformance and Scalability for HPC and AI

18© 2018 Mellanox Technologies | Confidential

SHARP Allreduce Performance Advantages

SHARP Enables 75% Reduction in Latency

Providing Scalable Flat Latency

Page 19: Highest Peformance and Scalability for HPC and AI

19© 2018 Mellanox Technologies | Confidential

Performs the Gradient AveragingRemoves the need for physical parameter serverRemoves all parameter server overhead

Mellanox SHARP Technology Accelerates AI

The CPU in a parameter serverquickly becomes the bottleneck(roughly 4 nodes)

Page 20: Highest Peformance and Scalability for HPC and AI

20© 2018 Mellanox Technologies | Confidential

Mellanox Accelerates TensorFlow with RDMA

Unmatched Linear Scalability at No Additional Cost

50% Better

Performance

Page 21: Highest Peformance and Scalability for HPC and AI

21© 2018 Mellanox Technologies | Confidential

Mellanox Accelerates TensorFlow (Advanced Verbs)

10GbE is not enough for large scale models 6.5X Faster Training

with 100GbE

2.5X

6.5X

Page 22: Highest Peformance and Scalability for HPC and AI

22© 2018 Mellanox Technologies | Confidential

Mellanox Accelerates NVIDIA NCCL 2.0

50% PerformanceImprovement

with NVIDIA® DGX-1 across32 NVIDIA Tesla V100 GPUsUsing InfiniBand RDMAand GPUDirect™ RDMA

Page 23: Highest Peformance and Scalability for HPC and AI

23© 2018 Mellanox Technologies | Confidential

Questions?