hpc technologies driving advances in machine learning · 2016-09-26 · hpc technologies driving...

Asaf Wachtel, Sr. Director BD, Mellanox

Bob Keating, Solutions Architect, NVIDIA

HPC for Wall Street, September 2016

HPC Technologies Driving Advances in

Machine Learning

© 2016 Mellanox Technologies 2

Agenda

Introduction to Machine Learning & Deep Learning

GPUs in Machine Learning – Use Cases & Benefits

High Performance Interconnect in Machine Learning – Use Cases & Benefits

Roadmap – Where do we go from here?

3

ENTERPRISE AUTOGAMING DATA CENTERPRO VISUALIZATION

THE WORLD LEADER IN VISUAL COMPUTING

4

THE AI RACE IS ON

Google Brain

ImageNetNVIDIA cuDNN

IBM WatsonJeopardy Theano

Caffe

Torch Microsoft

Google

ML Beats Humans

Google Car 1M Miles

Toyota $1B AI Lab

2010 2011 2012 2013 2014 2015

Facebook Big Sur

MS AzureML CNTK

Google TensorFlow

Amazon ML

IBM Watson

OpenAI

Microsoft ImageNet

6

DEEP LEARNING DEMANDS NEW CLASS OF HPC

TRAINING INFERENCING

Data / Users

ScalablePerformance

Throughput+ Efficiency

Billions of TFLOPS per training run

Years of compute-days on Xeon CPU

GPU turns years to days

Billions of FLOPS per inference

Seconds for response on Xeon CPU

GPU for instant response

7NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

TESLA P100New GPU Architecture to Enable the World’s Fastest Compute Node

Pascal Architecture NVLink CoWoS HBM2 Page Migration Engine

Highest Compute Performance GPU Interconnect for Maximum Scalability

Unifying Compute & Memory in Single Package

Simple Parallel Programming with Virtually Unlimited Memory

Space

Unified Memory

CPU

Tesla P100

9

Instant productivity — plug-and-play, supports every AI framework

Performance optimized across the entire stack

Always up-to-date via the cloud

Mixed framework environments —containerized

Direct access to NVIDIA experts

DGX STACKFully integrated Deep Learning platform

10

NVIDIA DIGITSInteractive Deep Learning GPU Training System

Test Image

Monitor ProgressConfigure DNNProcess Data Visualize Layers

developer.nvidia.com/digits

http://developer.nvidia.com/digits

13

TESLA END-TO-END DEEP LEARNING

TRAINING INFERENCING

Tesla P100

65XTesla P4

40Xin 3 years In 2 years

Training: comparing to Kepler GPU in 2013 using Caffe, Inference: comparing img/sec/watt to CPU: Intel E5-2697v4 using AlexNet

14

NVIDIA DEEP LEARNING INSTITUTE

Access to self-study and instructor-led on-line courses and training materials.

Now with links to on-line training from Coursera, Microsoft, and Udacity.

On-line interactive courses provide complete coding environment – ask for

tokens (free).

developer.nvidia.com/deep-learning


High Performance Interconnect Usage for Machine Learning


Mellanox InfiniBand Proven and Most Scalable HPC Interconnect

“Summit” System “Sierra” System

Paving the Road to Exascale


Deep Learning – Natural Fit for HPC Technology

“Training deep neural networks is very computationally intensive: training one of our models takes tens of exaflops of work, and so HPC techniques are key to creating these models.

Because the neural network training problem is so arithmetically intense, we rely on computationally dense processors like GPUs, and because we need to scale the training process over multiple nodes, we rely on fast interconnect technologies such as Infiniband. Along with HPC hardware, we also use HPC software such as MPI and BLAS libraries. Perhaps most importantly, we approach problems from an HPC point of view: we examine the fundamental limits to our computation, and then push to see how close we can get to those limits.”

Andrew Ng, Chief Scientist, Baidu @ International Supercomputing Conference June 2016


Evolution of GPUDirect RDMA

Before GPUDirect

Network and third-party device drivers, did not

share buffers, and needed to make a redundant

copy in host memory.

With GPUDirect Shared Host Memory Pages

The network and GPU can share “pinned”

(page-locked) buffers, eliminating the need to

make a redundant copy in host memory.

Pre-GPUDirect

GPUDirect Shared Host Memory Pages Model


Eliminates CPU bandwidth and latency bottlenecks

Uses remote direct memory access (RDMA) transfers between GPUs

Resulting in significantly improved MPISendRecv efficiency between GPUs in remote nodes

GPUDirect™ RDMA

With GPUDirect™ RDMA

Using PeerDirect™


GPU-GPU Internode MPI Latency

Low

er is

Bette

r

Performance of MVAPICH2 with GPUDirect RDMA

88% Lower Latency

GPU-GPU Internode MPI Bandwidth

Hig

her

is B

ett

er

10X Increase in Throughput

Source: Prof. DK Panda

8.6X

2.37 usec

10x

http://www.google.com/url?sa=i&source=images&cd=&cad=rja&docid=l4bgqVY3Z-5H9M&tbnid=VeKX0Kar856WBM:&ved=0CAgQjRwwADgF&url=https://twitter.com/OSUCATS&ei=5a65UcmnGaqG0AWJ1YGwBg&psig=AFQjCNFmgs1A9YUXxMlqqJPS30QSMEHV0Q&ust=1371209829447892

http://www.google.com/url?sa=i&source=images&cd=&cad=rja&docid=l4bgqVY3Z-5H9M&tbnid=VeKX0Kar856WBM:&ved=0CAgQjRwwADgF&url=https://twitter.com/OSUCATS&ei=5a65UcmnGaqG0AWJ1YGwBg&psig=AFQjCNFmgs1A9YUXxMlqqJPS30QSMEHV0Q&ust=1371209829447892


Hadoop goes “Deep” with RDMA

Yahoo has 600PB of data spread across 40,000 Hadoop Nodes

Enhancing deep learning nodes with multiple GPUs & InfiniBand interconnect

Yahoo has Open-sourced CaffeOnSpark (github.com/yahoo/CaffeOnSpark)

• Multi-GPU support; MPI + RDMA support

Multiple Applications using the solution

• Image recognition, search, advertisement, fraud detection etc.

Spark ExecutorData Feeding & Control

Enhanced Caffew/ Multi-GPU in a node

Model Synchronizeracross Nodes

Spark Driver







Dataset from HDFS

ModelOn HDFS

RDMA with Mellanox Infiniband

Large Scale Distributed Deep Learning on Hadoop Clusters - Yahoo Big ML Team [link]

http://yahoohadoop.tumblr.com/post/129872361846/large-scale-distributed-deep-learning-on-hadoop

http://yahoohadoop.tumblr.com/post/129872361846/large-scale-distributed-deep-learning-on-hadoop


Mellanox Interconnect Enables Baidu’s Deep Image Supercomputer - Minwa

Mellanox Interconnect Enables Baidu’s Deep Image Supercomputer - Minwa

• 4x higher resolution images

• Less than 6% application learning error rate

Future use cases include driver-less cars

Ren Wu, Shengen Yan, Yi Shan, Qingqing Dang and Gang Sun from Baidu Research, Deep Image: Scaling up Image Recognition [link]

http://arxiv.org/vc/arxiv/papers/1501/1501.02876v1.pdf


Big Sur – An Open AI Platform from Facebook

An OCP based, GP-GPU AI Platform

Open Rack v2 compatible, 4OU chassis

Flexible Architecture supporting up to 8 GPUs

High Speed Interconnect for scale

Use Cases:

• Text Processing

• Language Modeling

• Artificial Intelligence

• Computer Vision

https://code.facebook.com/posts/1687861518126048/facebook-to-open-source-ai-hardware-design/

https://code.facebook.com/posts/1687861518126048/facebook-to-open-source-ai-hardware-design/

24NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

NVIDIA DGX-1WORLD’S FIRST DEEP LEARNING SUPERCOMPUTER

170 TFLOPS

8x Tesla P100 16GB in NVLink Cube Mesh

Optimized Deep Learning Software

Dual Xeon

7 TB SSD Deep Learning Cache

Dual 10GbE

Quad Infiniband 100Gb

3RU – 3200W

Mellanox ConnectX-4 + GPUDirect RDMA Inside


Come Visit our Booth @ HPC on Wall Street:

Highest-Performance 100Gb/s Interconnect Solutions

Transceivers

Active Optical and Copper Cables

(10 / 25 / 40 / 50 / 56 / 100Gb/s)VCSELs, Silicon Photonics and Copper

36 EDR (100Gb/s) Ports, <90ns Latency

Throughput of 7.2Tb/s

7.02 Billion msg/sec (195M msg/sec/port)

100Gb/s Adapter, 0.6us latency

200 million messages per second

(10 / 25 / 40 / 50 / 56 / 100Gb/s)

32 100GbE Ports, 64 25/50GbE Ports

(10 / 25 / 40 / 50 / 100GbE)

Throughput of 6.4Tb/s

26

SEE THE FUTURE OF AI IN DCLocation: Ronald Reagan Building &International Trade Center, Washington D.C.

Event Date: October 26-27, 2016

The GTC DC is a regional extensionof the GTC event held annually inSilicon Valley. GTC DC attendees can train and connect with the brightestminds in computing on the hottesttopics – including artificial intelligenceand deep learning, virtual reality, andautonomous machines.

Registration opens August 18, 2016at http://dc.gputechconf.com

https://www.google.com/maps/place/Ronald+Reagan+Building+and+International+Trade+Center/@38.893863,-77.031058,15z/data=!4m2!3m1!1s0x0:0x2e78c9142e08889d?sa=X&ved=0ahUKEwj6m5jIwvPNAhUL7WMKHVwdBMwQ_BIIczAN

https://www.google.com/maps/place/Ronald+Reagan+Building+and+International+Trade+Center/@38.893863,-77.031058,15z/data=!4m2!3m1!1s0x0:0x2e78c9142e08889d?sa=X&ved=0ahUKEwj6m5jIwvPNAhUL7WMKHVwdBMwQ_BIIczAN

http://dc.gputechconf.com/

Thank You

hpc technologies driving advances in machine learning · 2016-09-26 · hpc technologies driving...

Documents