highest peformance and scalability for hpc and ai

1© 2018 Mellanox Technologies | Confidential

Mellanox – The Intelligence Interconnect CompanyFebruary 2018

Highest Performance and Scalability for HPC and AI


NVMe, NVMe Over Fabrics

Big Data Storage

Hadoop / Spark

Software Defined Storage

SQL & NoSQL Database

Deep Learning

HPC

GPUDirect

RDMA

MPI

NCCLSHARP

Image , Video Recognition

Sentiment Analysis

Fraud & Flaw

Detection

Voice Recognition & Search

Same Interconnect Technology Enables a Variety of Applications


Accelerating the Next Generation of HPC and AI

Summit CORAL System

Fastest Supercomputer in Japan

Fastest Supercomputer in CanadaInfiniBand Dragonfly+ Topology

Sierra CORAL System

Highest Performance with InfiniBand


Higher Data SpeedsFaster Data Processing

Better Data Security

Adapters SwitchesCables &

Transceivers

SmartNIC System on a Chip

HPC and AI Needs the Most Intelligent Interconnect


Cloud Big Data

Enterprise

Business Intelligence

HPC

Storage

Security

Machine Learning

Internet of Things

Exponential Data Growth Everywhere


Big Data? No… REALLY BIG DATA

Average data generated in a self-driving vehicle is expected to reach 40TB for every eight hours of driving (this mostly applies to full service fleet vehicles)

The Pratt & Whitney PW1000G engine has 5,000 sensors installed, generating about 10 GB of data per second. With an average 12-hr. flight-time can produce up to 844 TB of data

Mellanox is the de-facto interconnect for deep learning deployments


What’s Important For ML & Big Data Networking?

High Bandwidth and Low Latency Massive amounts of data requires thick data-pipes NVIDIA DGX-1 use 4 x 100G InfiniBand Port latency < 90 nsec with Mellanox EDR InfiniBand Spectrum is the world’s lowest latency Ethernet switch

No Packet Loss! Mellanox Spectrum switches have ZERO packet loss

Offloads – Free up the CPU! Stateless Offloads RDMA GPUDirect RDMA, GPUDirect ASYNC Virtualization and Containers Networking

GPUDirect™ RDMA

In-Network Computing SHARP

Storage Acceleration RDMA NVMe over Fabrics Acceleration Erasure Coding Acceleration

Security IPsec, TLS Offloads

Resiliency Robust End-to-End Network Solutions


Data Centric Architecture to Overcome Latency Bottlenecks

Communications Latencies of 30-40us Communications Latencies of 3-4us

CPU-Centric (Onload) Data-Centric (Offload)

Network In-Network Computing

Intelligent Interconnect Paves the Road to Exascale Performance


In-Network Computing

Self-Healing Technology

In-Network Computing

Unbreakable Data Centers

Delivers Highest Application Performance

GPUDirect™ RDMA

Critical for HPC and Machine Learning ApplicationsGPU Acceleration Technology

10X Performance AccelerationCritical for HPC and Machine Learning Applications

35XFaster Network Recovery5000X

10X Performance Acceleration

Performance Acceleration

In-Network Computing Delivers Highest Performance


30%-250% Higher Return on Investment

Up to 50% Saving on Capital and Operation Expenses

Highest Applications Performance, Scalability and Productivity

InfiniBand Delivers Best Return on Investment

1.9X Better 2X Better 1.4X Better 2.5X Better 1.3X Better

Molecular DynamicsChemistry Automotive GenomicsWeather


Cognitive Toolkit

RDMA Supercharges Leading AI Frameworks

Mellanox Supercharges Leading AI Companies

Higher ROI

Lower CapEx& OpEx

60%

50%

An Intelligent Network Unlocks the Power of AI

http://www.google.com/url?sa=i&source=images&cd=&cad=rja&docid=kqGHk2rZStg0MM&tbnid=3uvyYV6a0huhCM:&ved=0CAgQjRwwAA&url=http://armsaroundthechild.org/ways-to-give/ways-to-give-usa/paypal/&ei=qn1VUZHFBcusrAelxoHIBw&psig=AFQjCNElUWNfaMlajpQ6OMT7WuBwrAmcoQ&ust=1364643626243740

http://www.google.com/url?sa=i&source=images&cd=&cad=rja&docid=kqGHk2rZStg0MM&tbnid=3uvyYV6a0huhCM:&ved=0CAgQjRwwAA&url=http://armsaroundthechild.org/ways-to-give/ways-to-give-usa/paypal/&ei=qn1VUZHFBcusrAelxoHIBw&psig=AFQjCNElUWNfaMlajpQ6OMT7WuBwrAmcoQ&ust=1364643626243740


10X Higher Performance with GPUDirect™ RDMA Technology

GPUDirect™ RDMA

Purpose-built for Acceleration of Deep Learning

Lowest communication latency for acceleration devices

No unnecessary system memory copies and CPU overhead


GPUDirect™ ASYNC

GPUDirect RDMA (3.0) – direct data path between the GPU and Mellanox interconnect Control path still uses the CPU

CPU prepares and queues communication tasks on GPU GPU triggers communication on HCA Mellanox HCA directly accesses GPU memory

GPUDirect ASYNC (GPUDirect 4.0) Both data path and control path go directly

between the GPU and the Mellanox interconnect

Maximum PerformanceFor GPU Clusters


Distributed Training

Training on large data sets can take a long time In some cases weeks

In many cases training need to happen frequently Model development and tuning Real life use cases may require retraining regularly

Accelerate training time by scale out architecture Add workers (nodes) to reduce training time

Data Parallelism is the solution

Network is critical element to accelerate Distributed Training!


RDMA Accelerates Distributed Training

Data Parallelism Communication Workers to parameter servers and parameter servers to workers Frequent Maybe high bandwidth Bursty Point to point and collectives

RDMA Improves Point to Point and Collective operations


Model Parallelism

Models size is limited by the compute engine (GPU for example)

In some cases the model doesn’t fit in compute Large models Small compute such as FPGAs

Model parallelism slice the model and run each part on different compute engine

Networking become a critical element

High bandwidth, low latency and RDMA are mandatory

Data Data


Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Reliable Scalable General Purpose Primitive

In-network tree-based aggregation mechanism Large number of groups Multiple simultaneous outstanding operations

Applicable to Multiple Use-cases HPC applications using MPI / SHMEM Distributed Machine Learning applications

Scalable High Performance Collective Offload Barrier, Reduce, All-Reduce, Broadcast and more Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND Integer and Floating-Point, 16/32/64/128 bits

SHARP Tree

SHARP Tree Aggregation Node

(Process running on HCA)

SHARP Tree Endnode

(Process running on HCA)

SHARP Tree Root


SHARP Allreduce Performance Advantages

SHARP Enables 75% Reduction in Latency

Providing Scalable Flat Latency


Performs the Gradient AveragingRemoves the need for physical parameter serverRemoves all parameter server overhead

Mellanox SHARP Technology Accelerates AI

The CPU in a parameter serverquickly becomes the bottleneck(roughly 4 nodes)


Mellanox Accelerates TensorFlow with RDMA

Unmatched Linear Scalability at No Additional Cost

50% Better

Performance


Mellanox Accelerates TensorFlow (Advanced Verbs)

10GbE is not enough for large scale models 6.5X Faster Training

with 100GbE

2.5X

6.5X


Mellanox Accelerates NVIDIA NCCL 2.0

50% PerformanceImprovement

with NVIDIA® DGX-1 across32 NVIDIA Tesla V100 GPUsUsing InfiniBand RDMAand GPUDirect™ RDMA


Questions?