Jan 2018
TESLA PLATFORM
2
A NEW ERA OF COMPUTING
PC INTERNETWinTel, Yahoo!1 billion PC users
1995 2005 2015
MOBILE-CLOUDiPhone, Amazon AWS2.5 billion mobile users
AI & IOTDeep Learning, GPU100s of billions of devices
3
Artificial IntelligenceComputer GraphicsGPU Computing
NVIDIA“THE AI COMPUTING COMPANY”
4
1980 1990 2000 2010 2020
GPU-Computing perf
1.5X per year
1000X
by
2025
RISE OF GPU COMPUTING
Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K.
Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp
102
103
104
105
106
107
Single-threaded perf
1.5X per year
1.1X per year
APPLICATIONS
SYSTEMS
ALGORITHMS
CUDA
ARCHITECTURE
5
ELEVEN YEARS OF GPU COMPUTING
2010
Fermi: World’s First HPC GPU
World’s First Atomic Model of HIV Capsid
GPU-Trained AI Machine Beats World Champion in Go
2014
Stanford Builds AI Machine using GPUs
World’s First 3-D Mapping of Human Genome
Google Outperforms Humans in ImageNet
2012
Discovered How H1N1 Mutates to Resist Drugs
Oak Ridge Deploys World’s Fastest Supercomputer w/ GPUs
2008
World’s First GPU Top500 System
2006
CUDA Launched
AlexNet beats expert code by huge margin using GPUs
Top 13 Greenest Supercomputers Powered
by NVIDIA GPUs
2017
6
TESLA PLATFORMWorld’s Leading Data Center Platform for Accelerating HPC and AI
TESLA GPU & SYSTEMS
NVIDIA SDK
INDUSTRY FRAMEWORKS & TOOLS
APPLICATIONS
FRAMEWORKS
INTERNET SERVICES
DEEP LEARNING SDK
CLOUDTESLA GPU NVIDIA DGX-1 NVIDIA HGX-1
ENTERPRISE APPLICATIONS
Manufacturing
Automotive
Healthcare Finance
Retail
Defense
…
DeepStream SDKNCCL cuBLAS
cuSPARSEcuDNN TensorRT
ECOSYSTEM TOOLS
HPC
+450 Applications
COMPUTEWORKS
CUDA C/C++ FORTRAN
SYSTEM OEMTESLA GPU NVIDIA DGX-1 CLOUDNVIDIA HGX-1TESLA GPU NVIDIA DGX-1
7
500+ GPU-ACCELERATED APPLICATIONS
All Top 15 HPC Apps Accelerated
VASP
AMBER
NAMD
GROMACS
Gaussian
Simulia Abaqus
WRF
OpenFOAM
ANSYS
LS-DYNA
BLAST
LAMMPS
ANSYS Fluent
Quantum Espresso
GAMESS
14X GPU DEVELOPERS
2017
615,00045,000
2012
DEFINING THE NEXT GIANT WAVE IN HPC
OAK RIDGE SUMMIT
US’s next fastest supercomputer
200+ Petaflop HPC; 3+ Exaflop of AI
ABCI Supercomputer (AIST)
Japan’s fastest AI supercomputer
Piz Daint
Europe’s fastest supercomputer
MOST ADOPTED PLATFORM FOR ACCELERATING HPC
8
EVERY DEEP LEARNING FRAMEWORK ACCELERATED
25X COMPANIES ENGAGED
2017
39,637
1500
2014
AVAILABLE EVERYWHERE
Cloud Services
Systems
Desktops
MOST ADOPTED PLATFORM FOR ACCELERATING AI
9
TESLA PLATFORM FOR HPC
10
0
10
20
30
40
50
60
70
80
20 40 60 80 1000
# of CPUs
1 Node with 4x V100 GPUs
48 CPU Nodes Comet Supercomputer
ns/
day
AMBER Simulation of CRISPR, Nature’s Tool for Genome Editing
ARCHITECTING MODERN DATACENTERSBIG INEFFICIENCIES WITH CPU NODES
Single GPU Server 3.5x Faster than the Largest CPU Data Center
AMBER 16 Pre-release, CRSPR based on PDB ID 5f9r, 336,898 atomsCPU: Dual Socket Intel E5-2680v3 12 cores, 128 GB DDR4 per node, FDR IB
11
WEAK NODESLots of Nodes Interconnected with
Vast Network Overhead
STRONG NODESFew Lightning-Fast Nodes with
Performance of Hundreds of Weak Nodes
Network Fabric
Server Racks
12
ARCHITECTING MODERN DATACENTERS
Strong Core CPU for Sequential code
Volta 5,120 CUDA Cores
125 TFLOPS Tensor Core
NVLink for Strong Scaling
ARCHITECTING MODERN DATACENTERS
13
70% OF THE WORLD’S SUPERCOMPUTINGWORKLOAD ACCELERATED
Intersect360 Research, Nov 2017 “HPC Application Support for GPU Computing”
VASP
AMBER
NAMD
GROMACS
Gaussian
Simulia Abaqus
WRF
OpenFOAM
ANSYS
LS-DYNA
BLAST
LAMMPS
ANSYS Fluent
Quantum Espresso
GAMESS
500+ Accelerated ApplicationsTop 15 HPC Applications
14
GPU-ACCELERATED HPC APPLICATIONS500+ APPLICATIONS
MFG, CAD, & CAE
111 apps
Including:• Ansys
Fluent• Abaqus
SIMULIA• AutoCAD• CST Studio
Suite
LIFE SCIENCES
50+app
Including:• Gaussian• VASP• AMBER• HOOMD-
Blue• GAMESS
DATA SCI. & ANALYTICS
Including:• MapD• Kinetica• Graphistry
23apps
DEEP LEARNING
32 apps
Including:• Caffe2• MXNet• Tensorflow
MEDIA & ENT.
142 apps
Including:• DaVinci
Resolve• Premiere
Pro CC• Redshift
Renderer
PHYSICS
20 apps
Including:• QUDA• MILC• GTC-P
OIL & GAS
17 apps
Including:• RTM• SPECFEM
3D
SAFETY & SECURITY
15apps
Including:• Cyllance• FaceControl• Syndex Pro
TOOLS & MGMT.
15apps
Including:• Bright
Cluster Manager
• HPCtoolkit• Vampir
FEDERAL & DEFENSE
13 apps
Including:• ArcGIS Pro• EVNI• SocetGXP
CLIMATE & WEATHER
4apps
Including:• Cosmos• Gales• WRF
COMP.FINANCE
16 apps
Including:• O-Quant
Options Pricing
• MUREX• MISYS
15
DEEP LEARNING COMES TO HPC
ERRORS
REGRESSION TESTING (FP16/INT8)
INFERENCE (FP16/INT8)
TRAINING (FP32/FP16)
SIMULATION (FP64/FP32)
NEW DATA
TRAINING SET REGRESSION SET NEW DATA
16
UIUC & NCSA: ASTROPHYSICS
5,000X LIGO Signal Processing
U. FLORIDA & UNC: DRUG DISCOVERY
300,000X Molecular Energetics Prediction
SLAC: ASTROPHYSICS
Gravitational Lensing: From Weeks to 10ms
AI ACCELERATES SCIENCE
U.S. DoE: PARTICLE PHYSICS
33% More Accurate Neutrino Detection
PRINCETON & ITER: CLEAN ENERGY
50% Higher Accuracy for Fusion Sustainment
U. PITT: DRUG DISCOVERY
35% Higher Accuracy for Protein Scoring
AI ACCELERATES SCIENTIFIC DISCOVERY
17
ONE PLATFORM BUILT FOR BOTHDATA SCIENCE & COMPUTATIONAL SCIENCE
Accelerating AITesla Platform Accelerating HPC
CUDA
18
DRAMATICALLY MORE FOR YOUR MONEY
1 RACK ($0.8M)
# of Racks (~30 KW Per Rack)
15105 20
360 CPUs
1152 CPUs
36 CPUs + 72 V100s
5 RACKS ($2.0M)
VASP
RTM
Compute Servers,
85%
Non-Compute 15%
EQUAL THROUGHPUT WITH FEWER RACKS BUDGET: SMALLER, EFFICIENT
Compute Servers,
39%
Rack, Cabling
Infrastructure
Networking
Non-compute,
61%
Source: Traditional Data Centers Cost model by Microsoft Research on Datacenter Costs
14 RACKS ($6.0M)
1764 CPUsResNet-50 (DL Training)22 RACKS ($9.2M)
25
Save Up To $8M With Each GPU-Accelerated Rack
0
19
DATA CENTER SAVINGS FOR MIXED WORKLOADS5X Better HPC TCO for Same Throughput
160 Self-hosted Servers
96 KWatts
12 Accelerated Servers w/4 V100 GPUs
20 KWatts
MIXED WORKLOAD:Materials Science (VASP)Life Sciences (AMBER)Physics (MILC)Deep Learning (ResNet-50)
SAMETHROUGHPUT
1/3 THE COST
1/4THE SPACE
1/5THE POWER
MIXED WORKLOAD:Materials Science (VASP)Life Sciences (AMBER)Physics (MILC)Deep Learning (ResNet-50)
20
TESLA V100The Fastest and Most Productive GPU for AI and HPC
Volta Architecture
Most Productive GPU
Tensor Core
125 Programmable
TFLOPS Deep Learning
Improved SIMT Model
New Algorithms
Volta MPS
Inference Utilization
Improved NVLink &
HBM2
Efficient Bandwidth
21
3+EFLOPSTensor Ops
AI Exascale Today
ACME
DIRAC FLASH GTC
HACC LSDALTON NAMD
NUCCOR NWCHEM QMCPACK
RAPTOR SPECFEM XGC
Accelerated Science
10XPerf Over Titan
20 PF
200 PF
Performance Leadership
VOLTA TO FUEL SUMMITNext Milestone In AI Supercomputing
5-10XApplication Perf Over Titan
22
BREAKTHROUGH EFFICIENCY ON THE PATH TO EXASCALE
Ahead Of The Curve
GFLO
PS p
er
Watt
0
5
10
15
20
25
30
35
9.5 SaturnV
P100
Top GPU Systems in Green500 List with measured performance and NVIDIA Projections for V100
33 GF/WExascale Goal
14.1 Tsubame 3
P1005.3 Tsubame-
KFCK80
4.4 Tsubame-
KFCK20X
3.2 EurotechAurora
K20
V100
13/13 Greenest Supercomputers Powered by Tesla P100
TSUBAME 3.0
Kukai
AIST AI Cloud
RAIDEN GPU subsystem
Piz Daint
Wilkes-2
GOSAT-2 (RCF2)
DGX Saturn V
Reedbush-H
JADE
Facebook Cluster
Cedar
DAVIDE
23
Delivered Value Grows Over Time
POWER OF GPU COMPUTING PLATFORM
0
10
20
30
40
50
60
K20 (2013)
K40(2014)
K80(2015)
P100(2016)
V100(2017)
AMBER Performance (ns/ day)
AMBER 12CUDA 4
AMBER 14CUDA 4
AMBER 14CUDA 6
AMBER 16CUDA 8
AMBER 16CUDA 9
0
2400
4800
7200
9600
12000
8X K80 (2014)
8X MAXWELL (2015)
DGX-1 (2016)
DGX-1V (2017)
GoogleNet Performance (i/s)
cuDNN 2CUDA 6
cuDNN 4CUDA 7
cuDNN 6CUDA 8
NCCL 1.6
cuDNN 7CUDA 9NCCL 2
Amber dataset: Cellulose NVE; GoogLeNet dataset: Imagenet
24
TESLA PLATFORM FOR AI
25
AI REVOLUTIONIZING OUR WORLD
Search, Assistants, Translation, Recommendations, Shopping, Photos… Detect, Diagnose and Treat Diseases
Powering Breakthroughs in Agriculture, Manufacturing, EDA
Bigger and More Compute Intensive
NEURAL NETWORK COMPLEXITY IS EXPLODING
2013 2014 2015 2016 2017 2018
Speech(GOP * Bandwidth)
DeepSpeech
DeepSpeech 2
DeepSpeech 3
30X
2011 2012 2013 2014 2015 2016 2017
Image(GOP * Bandwidth)
ResNet-50
Inception-v2
Inception-v4
AlexNet
GoogleNet
350X
2014 2015 2016 2017 2018
Translation(GOP * Bandwidth)
MoE
OpenNMT
GNMT
10X
27
PLATFORM BUILT FOR AIDelivering 125 TFLOPS of DL Performance with Volta
TENSOR CORE
VOLTA-OPTIMIZED cuDNN
MATRIX DATA OPTIMIZATION:
Dense Matrix of Tensor Compute
TENSOR-OP CONVERSION:
FP32 to Tensor Op Data for Frameworks
TENSOR CORE
VOLTA TENSOR CORE 4x4 matrix processing array
D[FP32] = A[FP16] * B[FP16] + C[FP32]Optimized For Deep Learning
ALL MAJOR FRAMEWORKS
28
Training
Device
Datacenter
GPU DEEP LEARNING IS A NEW COMPUTING MODEL
TRAINING
Billions of Trillions of Operations
GPU train larger models, accelerate
time to market
29
REVOLUTIONARY AI PERFORMANCE3X Faster DL Training Performance
3X Reduction in Time to Train Over P100
0 10 20
1XV100
1XP100
2XCPU
Relative Time to Train Improvements(LSTM)
Neural Machine Translation Training for 13 Epochs |German ->English, WMT15 subset | CPU = 2x Xeon E5 2699 V4
15 Days
18 Hours
6 Hours
Over 80X DL Training Performance in 3 Years
1x K80cuDNN2
4x M40cuDNN3
8x P100cuDNN6
8x V100cuDNN7
0x
20x
40x
60x
80x
100x
Q1
15
Q3
15
Q2
17
Q2
16
Exponential Performance over time(GoogleNet)
Speedup v
s K80
GoogleNet Training Performance on versions of cuDNNVs 1x K80 cuDNN2
30
NVIDIA GPUS POWER WORLD’S FASTEST DEEP LEARNING PERFORMANCE
Image of ResNet 50 network
(…) Preferred Networks Nov '171024 Tesla P100
IBM Aug '17256 Tesla P100
Facebook June '17256 Tesla P100
48 Mins
60 Mins
15 Mins
Time to Train
ResNet-50 ResNet-50 | Dataset: Imagenet | Trained for 90 Epochs
31
Training
Device
Datacenter
GPU DEEP LEARNING IS A NEW COMPUTING MODEL
DATACENTER INFERENCING
10s of billions of image, voice, video
queries per day
GPU inference for fast response,
maximize datacenter throughput
32
NVIDIA TENSORRT PROGRAMMABLE INFERENCE ACCELERATOR
TESLA V100
DRIVE PX 2
TESLA P4
JETSON TX2
NVIDIA DLA
TensorRT
33
IMAGES
0
1,000
2,000
3,000
4,000
5,000
6,000
Images/
Sec (
Targ
et
7m
s la
tency)NVIDIA TENSORRT 3
World’s Fastest Inference Platform
ResNet-50 Throughput
14ms
CPU + TensorFlow
V100 + TensorFlow
V100 +TensorRT
7ms 7ms
TRANSLATION
0
100
200
300
400
500
600
Sente
nces/
Sec (
Targ
et
200m
s la
tency) OpenNMT Throughput
280ms
CPU + Torch
V100 + Torch
V100 +TensorRT
153ms
117ms
34
NVIDIA PLATFORM SAVES DATA CENTER COSTSGame Changing Inference Performance
1 HGX Server
45,000 images/sec
3 KWatts
Image recognition using Resnet-50
160 CPU Servers
45,000 images/sec
65 KWatts
INFERENCE WORKLOAD:Image recognition using Resnet 50
INFERENCE WORKLOAD:Image recognition using Resnet 50
SAMETHROUGHPUT
1/4THE SPACE
1/22THE POWER
35
GPU-ACCELERATED INFERENCE
iFLYTEKSPEECH RECOGNITION
VALOSSAVIDEO INTELLIGENCE
MICROSOFT BINGVISUAL SEARCH
36
TESLA PRODUCT FAMILY
37
END-TO-END PRODUCT FAMILY
HYPERSCALE HPC
Deep learning training & inference
Training & Inference - Tesla V100
Most Efficient Inference & Transcoding - Tesla P4
STRONG-SCALE HPC
HPC and DL workloads scaling to multiple GPUs
Tesla V100 with NVLink
MIXED-APPS HPC
HPC workloads with mix of CPU and GPU workloads
Tesla V100 with PCI-E
FULLY INTEGRATED SUPERCOMPUTER
DGX-1 Server
Fully integrated deep learning solution
DGX Station
38
75% Perf at Half the Power
OPTIMIZED FOR DATACENTER EFFICIENCY30% More Performance in a Rack
Computer Vision
V100@ MAXQ
Computer Vision
V100@ MAXP
13 KW Rack4 Nodes of 8xV100
1XResNet-50 Rack
Throughput
13 KW Rack7 Nodes of 8xV100
1.3XResNet-50 Rack
Throughput
ResNet-50 Training
0
10
20
30
40
50
60
70
80
50 100 150 200 250 300
DL PerfDL Perf / Watt
Watts
Max Performance
Max Efficiency
39
TESLA V100
Core 5120 CUDA cores, 640 Tensor cores 5120 CUDA cores, 640 Tensor cores
Compute 7.8 TF DP ∙ 15.7 TF SP ∙ 125 TF DL 7 TF DP ∙ 14 TF SP ∙ 112 TF DL
Memory HBM2: 900 GB/s ∙ 16 GB HBM2: 900 GB/s ∙ 16 GB
InterconnectNVLink (up to 300 GB/s) +
PCIe Gen3 (up to 32 GB/s)PCIe Gen3 (up to 32 GB/s)
Power 300W 250W
Available Now Now
For NVLink Servers For PCIe Servers
40
TESLA PLATFORM FOR CLOUD PROVIDERS
41
CLOUD GPU DEMAND OUTSTRIPS SUPPLY
Q3 2016 Q4 2016
“P2 instance is one of the fastest growing
instance in AWS history.”
- Andrew Jassy, AWS CEO, re:Invent 2016
AWS Launches P2 Instance
“We’ve had thousands of customers participate
in the N-Series preview since we launched it
back in August.”
- Corey Sanders, Director of Compute, Azure
Azure Launches N-Series Preview
42
Compute
AWS P3 - up to 8X
V100 SXM2
Available only in N.
Virginia, Oregon, Ireland,
Tokyo
AWS P2 – up to 8X
K80 Physical cards
https://aws.amazon.com/
ec2/instance-types/p3/
https://aws.amazon.com
/ec2/instance-types/p2/
GPU Server - up to
4X K80
GPU Server - up to
4X P100 PCIe Public Beta available
https://cloud.google.com
/gpu/
GPU Server - up to
2X K80, 1X P100 PCIe
(In Bare-metal)
https://www.ibm.com/cl
oud-
computing/bluemix/gpu-
computing
NC series - up to 2X K80
NC v2 & ND series - up
to 4X P100 PCIe/ 4X P40 Available only in US West 2
Region
https://azure.microsoft.com/
en-us/pricing/details/virtual-
machines/series/#n- series
X7 shape - up to 2X
P100 (In Bare-metal
and VM)
– Available only in
Ashburn region. Frankfurt
to come in Jan 2018
https://cloud.oracle.com
/infrastructure/compute
Virtual W/S AWS G3 – M60
GPU Server - P100 PCIe
vWSprivate alpha available
GPU Server - P100 PCIe
vWSpublic beta – Jan 18
GPU Server - up to 2X
M60, 2X M10
https://www.ibm.com/cloud-
computing/bluemix/gpu-
computing
GPU Server - M60
https://azure.microsoft.com/
en-us/pricing/details/virtual-
machines/series/#n-series
Virtual PCGPU Server - up to 4X
K520 Physical cards
GPU Server - M60
GPU Server - M10
Vmware Horizon Air
vPC launch Jan
https://www.ibm.com/cloud-
computing/bluemix/gpu-
computing
GLOBAL CSP OFFERINGS
43
NVIDIA GPU CLOUD
Innovate in minutes, not weeksRemoves all the DIY complexity of DL and HPC software integration
Cross platformContainers run locally on DGX Systems and TITAN PCs, or on cloud service provider GPU instances
Always up to dateMonthly updates by NVIDIA to ensure maximum performance
AI and HPC Everywhere, For Everyone
NVIDIA GPU Cloud integrates GPU-optimized
deep learning frameworks, HPC apps,
runtimes, libraries, and OS into a ready-to-run
container, available at no charge
44
NVIDIA GPU CLOUDSIMPLIFYING AI & HPC
DEEP LEARNING HPC APPS HPC VIZ
45
NGC GPU-OPTIMIZED DEEPLEARNING CONTAINERS
NVCaffe
Caffe2
Microsoft Cognitive Toolkit (CNTK)
DIGITS
MXNet
PyTorch
TensorFlow
Theano
Torch
CUDA (base level container for developers)
NEW! – NVIDIA TensorRT inference accelerator with ONNX support
A Comprehensive Catalog of Deep Learning Software
46
HPC APPS COMING TO NVIDIA GPU CLOUD
47
Large-scale Volumetric Rendering
Physically Accurate Ray Tracing
Production-quality Images
Seamless integration with ParaView
Early Access NOW
Signup now at nvidia.com/gpu-cloud
U CLOUD FOR HPC VISUALIZATION
UNIFIED VISUALIZATIONFOR LARGE DATA SETS
ParaView with NVIDIA OptiX
ParaView with NVIDIA Holodeck
ParaView with NVIDIA IndeX
NVIDIA GPU CLOUD FOR HPC VISUALIZATION
48
TESLA PLATFORM FOR DEVELOPERS
49
50
HOW GPU ACCELERATION WORKSApplication Code
+
GPU CPU5% of Code
Compute-Intensive Functions
Rest of SequentialCPU Code
51
DEEP LEARNING
GPU ACCELERATED LIBRARIES“Drop-in” Acceleration for Your Applications
LINEAR ALGEBRA PARALLEL ALGORITHMS
SIGNAL, IMAGE & VIDEO
TensorRT
nvGRAPH NCCL
cuBLAS
cuSPARSE cuRAND
DeepStream SDK NVIDIA NPPcuFFT
CUDA
Math library
cuSOLVER
CODEC SDKcuDNN
52
CUDA TOOLKIT 9
Optimized for Volta:
• Tensor Cores
• Second-Generation NVLink
• HBM2 Stacked Memory
UNLEASHES POWER OF VOLTA
COOPERATIVE THREAD GROUPS
Flexible Thread Groups
Efficient Parallel Algorithms
• Synchronize Across Thread Blocks in a Single GPU or Multi-GPUs
• GEMM Optimizations for RNNs (cuBLAS)
• >20x Faster Image Processing (NPP)
• FFT Optimizations Across Various Sizes (cuFFT)
FASTER LIBRARIES
DEVELOPER TOOLS & PLATFORM UPDATES
• 1.3x Faster Compiling
• New OS and Compiler Support
• Unified Memory Profiling
• NVLink Visualization
53
WHAT IS OPENACC
main(){<serial code>#pragma acc kernels{ <parallel code>
}}
Add Simple Compiler Directive
OpenACC is a directives-
based programming approach
to parallel computing
designed for performance
and portability on CPUs
and accelerators for HPC (OpenPOWER, Sunway, x86 CPU & Xeon Phi, NVIDIA GPU, PEZY-SC)
Read more at www.openacc.org
54
OPENACC: EASY ONBOARD TO GPU COMPUTINGA Widely Adopted Directives Model for Parallel Programing
POWER
Sunway
x86 CPU
x86 Xeon Phi
NVIDIA GPU
AMD
PEZY-SC
0
20
40
60
80
100
120
140
160
Multicore Broadwell Multicore POWER8
PGI OpenACC
Intel/IBM OpenMP
10x 11x 11x
120x
77x
158x
AWE Hydrodynamics CloverLeaf mini-App
(bm32 data set)
SIMPLE. POWERFUL. PORTABLE.
Speedup v
s S
ingle
Hasw
ell
Core
10x
Volta V1002x1x 4x
5 CAAR Codes: GTC, XGC, ACME, FLASH, LSDalton
3 of Top 5 HPC Apps:ANSYS Fluent, VASP, Gaussian
2017 Gordon Bell Finalist:CAM-SE on TaihuLight
ADOPTED BY KEY HPC CODES
55
LSDalton
Quantum Chemistry
12X speedup in 1 week
Numeca
CFD
10X faster kernels2X faster app
PowerGrid
Medical Imaging
40 days to 2 hours
INCOMP3D
CFD
3X speedup
NekCEM
Computational Electromagnetics
2.5X speedup60% less energy
COSMO
Climate Weather
40X speedup3X energy efficiency
CloverLeaf
CFD
4X speedupSingle CPU/GPU code
MAESTROCASTRO
Astrophysics
4.4X speedup4 weeks effort
56
Resourceshttps://www.openacc.org/resources
Success Storieshttps://www.openacc.org/success-stories
Eventshttps://www.openacc.org/events
OPENACC RESOURCESGuides ● Talks ● Tutorials ● Videos ● Books ● Spec ● Code Samples ● Teaching Materials ● Events ● Success Stories ● Courses ● Slack ● Stack Overflow
Compilers and Tools https://www.openacc.org/tools
FREE
Compilers
57
NVIDIA DEEP LEARNING SDK
Powerful tools and libraries for designing and deploying GPU-accelerated deep learning applications
High performance building blocks for training and deploying deep neural networks on NVIDIA GPUs
Industry vetted deep learning algorithms and linear algebra subroutines for developing novel deep neural networks
Multi-GPU and multi-node scaling that accelerates training on up to eight GPU
High performance GPU-acceleration for deep learning
“ We are amazed by the steady stream
of improvements made to the NVIDIA
Deep Learning SDK and the speedups
that they deliver.”
— Frédéric Bastien, Team Lead (Theano) MILA
developer.nvidia.com/deep-learning-software
58
NVIDIA COLLECTIVECOMMUNICATIONS LIBRARY (NCCL)
Multi-GPU and multi-node collective communication primitives
developer.nvidia.com/nccl
High-performance multi-GPU and multi-node collective communication primitives optimized for NVIDIA GPUs
Fast routines for multi-GPU multi-node acceleration that maximizes inter-GPU bandwidth utilization
Easy to integrate and MPI compatible. Uses automatic topology detection to scale HPC and deep learning applications over PCIe and NVLink
Accelerates leading deep learning frameworks such as Caffe2, Microsoft Cognitive Toolkit, MXNet, PyTorch and more
Multi-Node:
InfiniBand verbs,
IP Sockets
Multi-GPU:
NVLink, PCIe
Automatic
Topology
Detection
216.925
843.475
1684.79
3281.07
6569.6
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
0 8 16 24 32
NCCL 2
Images/
Second
Near-Linear Multi-Node Scaling
Microsoft Cognitive Toolkit multi-node scaling performance (images/sec), NVIDIA DGX-1 + cuDNN 6
(FP32), ResNet50, Batch size: 64
59
NVIDIA DIGITSInteractive Deep Learning GPU Training System
developer.nvidia.com/digits
Interactive deep learning training application for engineers and data scientists
Simplify deep neural network training with an interactive interface to train and validate, and visualize results
Built-in workflows for image classification, object detection and image segmentation
Improve model accuracy with pre-trained models from the DIGITS Model Store
Faster time to solution with multi-GPU acceleration
60
NVIDIA cuDNNDeep Learning Primitives
developer.nvidia.com/cudnn
High performance building blocks for deep learning frameworks
Drop-in acceleration for widely used deep learning frameworks such as Caffe2, Microsoft Cognitive Toolkit, PyTorch, Tensorflow, Theano and others
Accelerates industry vetted deep learning algorithms, such as convolutions, LSTM RNNs, fully connected, and pooling layers
Fast deep learning training performance tuned for NVIDIA GPUs
“ NVIDIA has improved the speed of cuDNN
with each release while extending the
interface to more operations and devices
at the same time.”
— Evan Shelhamer, Lead Caffe Developer, UC Berkeley
0
2,000
4,000
6,000
8,000
10,000
12,000
8x K80 8x Maxwell DGX-1 DGX-1V
Images/
Second
cuDNN 7
NCCL 2
cuDNN 6
NCCL 1.6
cuDNN 4
cuDNN 2
Deep Learning Training Performance
61
NVIDIA TensorRT 3
Compiler for Optimized Neural Networks
Weight & Activation Precision Calibration
Layer & Tensor Fusion
Kernel Auto-Tuning
Multi-Stream Execution
TensorRT
Compiled & Optimized Neural
Network
Trained NeuralNetwork
Kernel Auto-tuning
Layer & Tensor Fusion
Dynamic Tensor
Memory
Weight & Activation
Precision Calibration
Multi-Stream
Execution
Programmable Inference Accelerator