hpc technologies driving advances in machine learning · 2016-09-26 · hpc technologies driving...
TRANSCRIPT
Asaf Wachtel, Sr. Director BD, Mellanox
Bob Keating, Solutions Architect, NVIDIA
HPC for Wall Street, September 2016
HPC Technologies Driving Advances in
Machine Learning
© 2016 Mellanox Technologies 2
Agenda
Introduction to Machine Learning & Deep Learning
GPUs in Machine Learning – Use Cases & Benefits
High Performance Interconnect in Machine Learning – Use Cases & Benefits
Roadmap – Where do we go from here?
3
ENTERPRISE AUTOGAMING DATA CENTERPRO VISUALIZATION
THE WORLD LEADER IN VISUAL COMPUTING
4
THE AI RACE IS ON
Google Brain
ImageNetNVIDIA cuDNN
IBM WatsonJeopardy Theano
Caffe
Torch Microsoft
ML Beats Humans
Google Car 1M Miles
Toyota $1B AI Lab
2010 2011 2012 2013 2014 2015
Facebook Big Sur
MS AzureML CNTK
Google TensorFlow
Amazon ML
IBM Watson
OpenAI
Microsoft ImageNet
5
6
DEEP LEARNING DEMANDS NEW CLASS OF HPC
TRAINING INFERENCING
Data / Users
ScalablePerformance
Throughput+ Efficiency
Billions of TFLOPS per training run
Years of compute-days on Xeon CPU
GPU turns years to days
Billions of FLOPS per inference
Seconds for response on Xeon CPU
GPU for instant response
7NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
TESLA P100New GPU Architecture to Enable the World’s Fastest Compute Node
Pascal Architecture NVLink CoWoS HBM2 Page Migration Engine
Highest Compute Performance GPU Interconnect for Maximum Scalability
Unifying Compute & Memory in Single Package
Simple Parallel Programming with Virtually Unlimited Memory
Space
Unified Memory
CPU
Tesla P100
8
NVIDIA DGX-1AI Supercomputer-in-a-Box
170 TFLOPS | 8x Tesla P100 16GB | NVLink Hybrid Cube Mesh
2x Xeon | 8 TB RAID 0 | Quad IB 100Gbps, Dual 10GbE | 3U — 3200W
9
Instant productivity — plug-and-play, supports every AI framework
Performance optimized across the entire stack
Always up-to-date via the cloud
Mixed framework environments —containerized
Direct access to NVIDIA experts
DGX STACKFully integrated Deep Learning platform
10
NVIDIA DIGITSInteractive Deep Learning GPU Training System
Test Image
Monitor ProgressConfigure DNNProcess Data Visualize Layers
developer.nvidia.com/digits
11
12
13
TESLA END-TO-END DEEP LEARNING
TRAINING INFERENCING
Tesla P100
65XTesla P4
40Xin 3 years In 2 years
Training: comparing to Kepler GPU in 2013 using Caffe, Inference: comparing img/sec/watt to CPU: Intel E5-2697v4 using AlexNet
14
NVIDIA DEEP LEARNING INSTITUTE
Access to self-study and instructor-led on-line courses and training materials.
Now with links to on-line training from Coursera, Microsoft, and Udacity.
On-line interactive courses provide complete coding environment – ask for
tokens (free).
developer.nvidia.com/deep-learning
© 2016 Mellanox Technologies 15
High Performance Interconnect Usage for Machine Learning
© 2016 Mellanox Technologies 16
Mellanox InfiniBand Proven and Most Scalable HPC Interconnect
“Summit” System “Sierra” System
Paving the Road to Exascale
© 2016 Mellanox Technologies 17
Deep Learning – Natural Fit for HPC Technology
“Training deep neural networks is very computationally intensive: training one of our models takes tens of exaflops of work, and so HPC techniques are key to creating these models.
Because the neural network training problem is so arithmetically intense, we rely on computationally dense processors like GPUs, and because we need to scale the training process over multiple nodes, we rely on fast interconnect technologies such as Infiniband. Along with HPC hardware, we also use HPC software such as MPI and BLAS libraries. Perhaps most importantly, we approach problems from an HPC point of view: we examine the fundamental limits to our computation, and then push to see how close we can get to those limits.”
Andrew Ng, Chief Scientist, Baidu @ International Supercomputing Conference June 2016
© 2016 Mellanox Technologies 18
Evolution of GPUDirect RDMA
Before GPUDirect
Network and third-party device drivers, did not
share buffers, and needed to make a redundant
copy in host memory.
With GPUDirect Shared Host Memory Pages
The network and GPU can share “pinned”
(page-locked) buffers, eliminating the need to
make a redundant copy in host memory.
Pre-GPUDirect
GPUDirect Shared Host Memory Pages Model
© 2016 Mellanox Technologies 19
Eliminates CPU bandwidth and latency bottlenecks
Uses remote direct memory access (RDMA) transfers between GPUs
Resulting in significantly improved MPISendRecv efficiency between GPUs in remote nodes
GPUDirect™ RDMA
With GPUDirect™ RDMA
Using PeerDirect™
© 2016 Mellanox Technologies 20
GPU-GPU Internode MPI Latency
Low
er is
Bette
r
Performance of MVAPICH2 with GPUDirect RDMA
88% Lower Latency
GPU-GPU Internode MPI Bandwidth
Hig
her
is B
ett
er
10X Increase in Throughput
Source: Prof. DK Panda
8.6X
2.37 usec
10x
© 2016 Mellanox Technologies 21
Hadoop goes “Deep” with RDMA
Yahoo has 600PB of data spread across 40,000 Hadoop Nodes
Enhancing deep learning nodes with multiple GPUs & InfiniBand interconnect
Yahoo has Open-sourced CaffeOnSpark (github.com/yahoo/CaffeOnSpark)
• Multi-GPU support; MPI + RDMA support
Multiple Applications using the solution
• Image recognition, search, advertisement, fraud detection etc.
Spark ExecutorData Feeding & Control
Enhanced Caffew/ Multi-GPU in a node
Model Synchronizeracross Nodes
Spark Driver
Spark ExecutorData Feeding & Control
Enhanced Caffew/ Multi-GPU in a node
Model Synchronizeracross Nodes
Spark ExecutorData Feeding & Control
Enhanced Caffew/ Multi-GPU in a node
Model Synchronizeracross Nodes
Dataset from HDFS
ModelOn HDFS
RDMA with Mellanox Infiniband
Large Scale Distributed Deep Learning on Hadoop Clusters - Yahoo Big ML Team [link]
© 2016 Mellanox Technologies 22
Mellanox Interconnect Enables Baidu’s Deep Image Supercomputer - Minwa
Mellanox Interconnect Enables Baidu’s Deep Image Supercomputer - Minwa
• 4x higher resolution images
• Less than 6% application learning error rate
Future use cases include driver-less cars
Ren Wu, Shengen Yan, Yi Shan, Qingqing Dang and Gang Sun from Baidu Research, Deep Image: Scaling up Image Recognition [link]
© 2016 Mellanox Technologies 23
Big Sur – An Open AI Platform from Facebook
An OCP based, GP-GPU AI Platform
Open Rack v2 compatible, 4OU chassis
Flexible Architecture supporting up to 8 GPUs
High Speed Interconnect for scale
Use Cases:
• Text Processing
• Language Modeling
• Artificial Intelligence
• Computer Vision
https://code.facebook.com/posts/1687861518126048/facebook-to-open-source-ai-hardware-design/
24NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
NVIDIA DGX-1WORLD’S FIRST DEEP LEARNING SUPERCOMPUTER
170 TFLOPS
8x Tesla P100 16GB in NVLink Cube Mesh
Optimized Deep Learning Software
Dual Xeon
7 TB SSD Deep Learning Cache
Dual 10GbE
Quad Infiniband 100Gb
3RU – 3200W
Mellanox ConnectX-4 + GPUDirect RDMA Inside
© 2016 Mellanox Technologies 25
Come Visit our Booth @ HPC on Wall Street:
Highest-Performance 100Gb/s Interconnect Solutions
Transceivers
Active Optical and Copper Cables
(10 / 25 / 40 / 50 / 56 / 100Gb/s)VCSELs, Silicon Photonics and Copper
36 EDR (100Gb/s) Ports, <90ns Latency
Throughput of 7.2Tb/s
7.02 Billion msg/sec (195M msg/sec/port)
100Gb/s Adapter, 0.6us latency
200 million messages per second
(10 / 25 / 40 / 50 / 56 / 100Gb/s)
32 100GbE Ports, 64 25/50GbE Ports
(10 / 25 / 40 / 50 / 100GbE)
Throughput of 6.4Tb/s
26
SEE THE FUTURE OF AI IN DCLocation: Ronald Reagan Building &International Trade Center, Washington D.C.
Event Date: October 26-27, 2016
The GTC DC is a regional extensionof the GTC event held annually inSilicon Valley. GTC DC attendees can train and connect with the brightestminds in computing on the hottesttopics – including artificial intelligenceand deep learning, virtual reality, andautonomous machines.
Registration opens August 18, 2016at http://dc.gputechconf.com
Thank You