![Page 2: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/2.jpg)
2
内容
GPU计算近况
NVIDIA 深度学习训练平台
NVIDIA 在线服务平台
Tesla GPU 产品及路线图
![Page 3: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/3.jpg)
3
CPU优化串行任务
GPU 加速器优化并行任务
加速计算10x 性能 & 5x 能源效率
![Page 4: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/4.jpg)
4
10x 加速计算的增长20152008
3 MillionCUDA Downloads
150,000CUDA Downloads
60,000 Academic Papers
4,000Academic
Papers
800Universities Teaching
60Universities
Teaching
54,000Supercomputing
Teraflops
77Supercomputing
Teraflops
450,000Tesla GPUs
6,000Tesla GPUs
334CUDA Apps
27CUDA Apps
![Page 5: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/5.jpg)
5
超算中心
高教
政府
能源
金融
制造
Tesla 加速政府和企业的HPC数据中心
Tokyo Institute of
Technology
Air Force
Research
Laboratory
Naval Research
Laboratory
![Page 6: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/6.jpg)
6
360+ GPU 加速的应用软件www.nvidia.com/appscatalog
![Page 7: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/7.jpg)
7
加速计算被快速地采用
NVIDIA GPU 加速器的首选
NVIDIA GPU
85%OTHERS
15%
113
206
242
367
0
50
100
150
200
250
300
350
2011 2012 2013 2014 2015
GPU 加速的应用
287
“超过一半的新HPC系统将安装加速器””-Intersect360 Research, Feb 2015
Intersect360 Research. Top 6 Prediction in HPC, Feb 2015
![Page 8: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/8.jpg)
8
高密度GPU服务器已成为主流
Cray CS-Storm8 K80s per Node
Dell C41304 K80s per Node
HP SL2708 K40s per Node
Sugon4 K80s per Node
![Page 9: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/9.jpg)
9
NVIDIA 深度学习训练平台
![Page 10: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/10.jpg)
10
深度学习的实例图像分类、目标检测、定位、行为识别
语音识别、语音翻译,自然语言处理
检测乳腺癌细胞有丝分裂,体积大脑图像分割
行人检测、车道检测,交通标志识别
![Page 11: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/11.jpg)
11
什么是深度学习?
Image “Volvo XC90”
Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks” ICML 2009 & Comm. ACM 2011.Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng.
![Page 12: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/12.jpg)
12
为什么深度学习现在这么热?
大数据 GPU 加速新的机器学习技术
350 millions images uploaded per day
2.5 Petabytes of customer data hourly
300 hours of video uploaded every minute
![Page 13: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/13.jpg)
13
GPU和深度学习
GPU实现 --- 相同或更好的预测精度- 更快的结果- 更小的占地面积- 低功率
72%
74%
84%
88%
93%
2010 2011 2012 2013 2014
ImageNet ChallengeAccuracy
NVIDIA CUDA GPU
NEURALNETWORKS
GPUS
固有的并行
矩阵运算
浮点运算
带宽
![Page 14: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/14.jpg)
14
NVIDIA 完整的深度学习平台
应用
DIGITS 工具
深度学习框架(caffe, Torch 等)
函数库
cuDNN, cuBlas …
GPU
Tesla
软件
系统管理
服务器
![Page 15: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/15.jpg)
15
NVIDIA cuDNN
高性能神经网络训练
GPU 加速 Caffe, Theano, Torch 和其他深度学习框架
支持使用广泛的层类型,包括pooling, ReLU, sigmoid, softmax, TANH
对最新的NVIDIA GPU架构进行了优化
支持 Linux, Windows, OSX 和 Linux for Tegra(ARM)
GPU 加速深度学习框架
http://developer.nvidia.com/cuDNN
0
20
40
60
80
cuDNN 1 cuDNN 2 cuDNN 3
性能持续提高
Millions of images trained per day
![Page 16: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/16.jpg)
16
NVIDIA DIGITS交互式的GPU深度学习训练系统
Test Image
Monitor ProgressConfigure DNNProcess Data Visualize Layers
http://developer.nvidia.com/digits
![Page 17: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/17.jpg)
17
GPU 已加速的深度学习框架
CAFFE TORCH THEANO MINERVA KALDI
Deep Learning
Framework
Scientific Computing
Framework
Math Expression
Compiler
Deep Learning
Framework
Speech
RecognitionToolkit
cuDNN 3 3 3 3 --
Multi-GPU In Progress (nnet2)
Multi-Node (nnet2)
License BSD-2 BSD BSD Apache 2.0 Apache 2.0
Interface(s)Text-based
definition files,
Python, MATLAB
Python, Lua,
MATLABPython C++ C++, Shell scripts
Embedded
![Page 18: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/18.jpg)
18
NVIDIA 在线服务平台
![Page 19: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/19.jpg)
19NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
不断变化的工作负载适合GPU
视频转码 2X
Real time Super Resolution, Stabilization, Enhancements
Resize, Filter, Search, Auto-Enhance
H.264 & H.265, SD & HD
机器学习在线服务 2X
图像处理 5X视频处理 4X
![Page 20: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/20.jpg)
20NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
GPU提升数据中心的处理能力
Traditional
NewWorkload
NewWorkload Traditional
+Add GPUs to
boost data center
Available capacity GPU capacity for growth Reclaimed CPU capacity
![Page 21: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/21.jpg)
21NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
Tesla 平台加速大规模应用
Low Power, Small Form Factor GPU Acceleration in scale-out infrastructure
GPU REST EngineHigh throughput low latency accelerated services
Monitoring and Management Deploy fault-tolerant and elastic GPU systems
Media FrameworkPainless out-of-the box support of GPUs in FFMPEG and OBS
YARN
OBS FFMPEG
![Page 22: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/22.jpg)
22NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
为大规模应用而设计的低功耗GPU
Maxwell Architecture135% performance per core, 2x performance/Watt
of Kepler Architecture
Low Power, Small Form Factor PCIe Low Profile, 50W to 75W
Easy upgrade/retrofit
Versatile Compute PlatformCUDA, Video Enhancements, Analytics, General Acceleration
Independent Video EnginesIndependent on-chip video encode and decode engines accelerate H.264 and H.265
![Page 23: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/23.jpg)
23NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
质量 = 保证GPU处理使实时视频增强成为可能
视频稳定
图像对比度/锐度的提高
先进去噪
解封
缩小规模(Lanczos & poly-phase multi-tap filters)
超分辨率
平滑的帧速率上转换
![Page 24: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/24.jpg)
24NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
GPU
10 Streams
75 Watt
Decode Enhance Infer Encode
NVDEC NVENC NVDEC NVENC NVDEC NVENC NVDEC NVENC1080p30 10 Mbps1080p30 5 Mbps720p30 3 Mbps720p 1.5 Mbps480p 1.2 Mbps360p 1 Mbps
360p 0.5 Mbps240p 0.3Mbps240p 0.15Mbps
1080p30 5 Mbps
GPU 转码: 10x 的吞吐量, ½ 的功耗Video processing at scale
1080p30 h.264 source, enhancement includes deblocking, motion stabilization and scaling
CPU
1 Stream
150 Watt
![Page 25: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/25.jpg)
25
Tesla GPU 产品和路线图
![Page 26: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/26.jpg)
26
1H 2015 2H 2015 (新品)
K40
12GB
235W
PCIe Passive
K80
2xGPU, 2x12GB
300W
PCIe Passive
TESLA M60
2xGPU, 2x8GB
300W / 225W
PCIe Passive
2015年Tesla GPU 加速器产品
TESLA M40
12GB
250W
PCIe Passive
TESLA M6
8GB
75W / 100W
MXM PCIe in definition
Fastest DL
Solution
VDI
Solution
VDI Blade
Solution
NEW!
![Page 27: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/27.jpg)
27
为数据中心设计世界上最快的深度学习训练
MAXWELL 架构
• 24/7 Reliability
• Scalable Perf. w/ RDMA
• Datacenter mgmt. tools
up to 2.4x K40
up to 1.7x K80
Save days on each training
Tesla M40专门为深度学习而设计建造
3072 Core
~7 TFLOPS
12GB
![Page 28: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/28.jpg)
28NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
最快的深度学习训练
Caffe uses Alexnet, batch size = 128, 280M images training set
Torch uses OverFeat, batch size = 128, 140M images training set
Save days on each training iteration
Enable users to iterate to final solution much faster
0
2
4
6
8
10
Caffe Torch
# of days to train
K40
K80
M40
M40
K80
K40Save
3 daysSave
2 days
![Page 29: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/29.jpg)
29NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
GPUDirect RDMA
GPU之间直接传输数据
67% GPU到GPU的延迟降低
5x 高的GPU到GPU MPI 带宽
RDMA 加速扩展深度学习
Yahoo, Baidu use RDMA to speedup Deep Learning Training
“We have enhanced Caffe to use multiple GPUs on a server and benefit from RDMA tosynchronize Deep Learning models”http://yahoohadoop.tumblr.com/post/129872361846/large-scale-distributed-deep-learning-on-hadoop
“Given the properties of DL’s SGD algorithms, it is desired to have very high bandwidthand ultra low latency interconnects to minimize inter-node communication costs”http://arxiv.org/vc/arxiv/papers/1501/1501.02876v1.pdf
![Page 30: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/30.jpg)
30
2012 20142008 2010 2016 2018
48
36
12
0
24
60
72
TeslaFermi
Kepler
Maxwell
PascalMixed PrecisionDouble Precision3D MemoryNVLink
Volta
GPU 路线图SG
EM
M /
W
![Page 31: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/31.jpg)
31
PASCAL GPU 新特性
NVLINKGPU high speed interconnect
Connect CPU to GPU or GPU to GPU
NVLINK 1.0, 80 GB/s, 4 Link Pairs
3D Stacked Memory4x Higher Bandwidth (~1 TB/s)
3x Larger Capacity
4x More Energy Efficient per bit
![Page 32: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/32.jpg)
32
NVLink : 高速GPU互连
Whitepaper: http://www.nvidia.com/object/nvlink.html
PascalCPU
(NVLINK
Enabled)
GPU to CPU via NVLink GPU to GPU via NVLink
4 NVLink
20GB/s each
PCIe
Control
HBM
16-32GB
DDR Memory
10s-100s GB
1Tbyte/s
DDR4
50-75 GB/s
CPU
(x86)
Pascal Pascal
PCIe Switch
4 NVLink
20GB/s each
![Page 33: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/33.jpg)
33
NVLinkHigh-Speed GPU Interconnect
NVLink
NVLink
POWER CPU
X86, ARM64, POWER CPU
X86, ARM64, POWER CPU
PASCAL GPUKEPLER GPU
20162014
PCIe PCIe
![Page 34: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/34.jpg)
34
NVLink释放了Multi-GPU性能
343D FFT, ANSYS: 2 GPU configuration, All other apps comparing 4 GPU configuration AMBER Cellulose (256x128x128), FFT problem size (256^3)
TESLA
GPU
TESLA
GPU
CPU
5x Faster than
PCIe Gen3 x16
PCIe Switch
GPUs Interconnected with NVLink
1.00x
1.25x
1.50x
1.75x
2.00x
2.25x
ANSYS Fluent Multi-GPU Sort LQCD QUDA AMBER 3D FFT
Over 2x Application Performance SpeedupWhen Next-Gen GPUs Connect via NVLink Versus PCIe
Speedup vs PCIe based Server
![Page 35: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/35.jpg)
35NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
加速数据中心的一种灵活架构
8 GPU Cube Mesh
PCIe
Switch
CPU
PCIe
Switch
CPU
x
x
NVLINK + UVM
Efficient 4-GPU and 8-GPU scaling
Pascal
Best-in-class single GPU performance
vGPU
Graphics virtualization
![Page 36: NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”](https://reader036.vdocument.in/reader036/viewer/2022081613/5fbb7d3e42bbff57ce6cea59/html5/thumbnails/36.jpg)
谢谢 !