Download - NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”

罗华平，[email protected], 9 Nov 2015

NVIDIA GPU 深度学习平台

mailto:[email protected]

2

内容

GPU计算近况

NVIDIA 深度学习训练平台

NVIDIA 在线服务平台

Tesla GPU 产品及路线图

3

CPU优化串行任务

GPU 加速器优化并行任务

加速计算10x 性能 & 5x 能源效率

4

10x 加速计算的增长20152008

3 MillionCUDA Downloads

150,000CUDA Downloads

60,000 Academic Papers

4,000Academic

Papers

800Universities Teaching

60Universities

Teaching

54,000Supercomputing

Teraflops

77Supercomputing

Teraflops

450,000Tesla GPUs

6,000Tesla GPUs

334CUDA Apps

27CUDA Apps

5

超算中心

高教

政府

能源

金融

制造

Tesla 加速政府和企业的HPC数据中心

Tokyo Institute of

Technology

Air Force

Research

Laboratory

Naval Research

Laboratory

6

360+ GPU 加速的应用软件www.nvidia.com/appscatalog

7

加速计算被快速地采用

NVIDIA GPU 加速器的首选

NVIDIA GPU

85%OTHERS

15%

113

206

242

367

0

50

100

150

200

250

300

350

2011 2012 2013 2014 2015

GPU 加速的应用

287

“超过一半的新HPC系统将安装加速器””-Intersect360 Research, Feb 2015

Intersect360 Research. Top 6 Prediction in HPC, Feb 2015

8

高密度GPU服务器已成为主流

Cray CS-Storm8 K80s per Node

Dell C41304 K80s per Node

HP SL2708 K40s per Node

Sugon4 K80s per Node

9

NVIDIA 深度学习训练平台

10

深度学习的实例图像分类、目标检测、定位、行为识别

语音识别、语音翻译,自然语言处理

检测乳腺癌细胞有丝分裂,体积大脑图像分割

行人检测、车道检测,交通标志识别

11

什么是深度学习?

Image “Volvo XC90”

Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks” ICML 2009 & Comm. ACM 2011.Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng.

12

为什么深度学习现在这么热?

大数据 GPU 加速新的机器学习技术

350 millions images uploaded per day

2.5 Petabytes of customer data hourly

300 hours of video uploaded every minute

13

GPU和深度学习

GPU实现 --- 相同或更好的预测精度- 更快的结果- 更小的占地面积- 低功率

72%

74%

84%

88%

93%

2010 2011 2012 2013 2014

ImageNet ChallengeAccuracy

NVIDIA CUDA GPU

NEURALNETWORKS

GPUS

固有的并行

矩阵运算

浮点运算

带宽

14

NVIDIA 完整的深度学习平台

应用

DIGITS 工具

深度学习框架（caffe, Torch 等）

函数库

cuDNN, cuBlas …

GPU

Tesla

软件

系统管理

服务器

15

NVIDIA cuDNN

高性能神经网络训练

GPU 加速 Caffe, Theano, Torch 和其他深度学习框架

支持使用广泛的层类型，包括pooling, ReLU, sigmoid, softmax, TANH

对最新的NVIDIA GPU架构进行了优化

支持 Linux, Windows, OSX 和 Linux for Tegra(ARM)

GPU 加速深度学习框架

http://developer.nvidia.com/cuDNN

0

20

40

60

80

cuDNN 1 cuDNN 2 cuDNN 3

性能持续提高

Millions of images trained per day

http://developer.nvidia.com/cuDNN

16

NVIDIA DIGITS交互式的GPU深度学习训练系统

Test Image

Monitor ProgressConfigure DNNProcess Data Visualize Layers

http://developer.nvidia.com/digits

http://developer.nvidia.com/digits

17

GPU 已加速的深度学习框架

CAFFE TORCH THEANO MINERVA KALDI

Deep Learning

Framework

Scientific Computing

Framework

Math Expression

Compiler

Deep Learning

Framework

Speech

RecognitionToolkit

cuDNN 3 3 3 3 --

Multi-GPU In Progress (nnet2)

Multi-Node (nnet2)

License BSD-2 BSD BSD Apache 2.0 Apache 2.0

Interface(s)Text-based

definition files,

Python, MATLAB

Python, Lua,

MATLABPython C++ C++, Shell scripts

Embedded

18

NVIDIA 在线服务平台

19NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

不断变化的工作负载适合GPU

视频转码 2X

Real time Super Resolution, Stabilization, Enhancements

Resize, Filter, Search, Auto-Enhance

H.264 & H.265, SD & HD

机器学习在线服务 2X

图像处理 5X视频处理 4X


GPU提升数据中心的处理能力

Traditional

NewWorkload

NewWorkload Traditional

+Add GPUs to

boost data center

Available capacity GPU capacity for growth Reclaimed CPU capacity


Tesla 平台加速大规模应用

Low Power, Small Form Factor GPU Acceleration in scale-out infrastructure

GPU REST EngineHigh throughput low latency accelerated services

Monitoring and Management Deploy fault-tolerant and elastic GPU systems

Media FrameworkPainless out-of-the box support of GPUs in FFMPEG and OBS

YARN

OBS FFMPEG


为大规模应用而设计的低功耗GPU

Maxwell Architecture135% performance per core, 2x performance/Watt

of Kepler Architecture

Low Power, Small Form Factor PCIe Low Profile, 50W to 75W

Easy upgrade/retrofit

Versatile Compute PlatformCUDA, Video Enhancements, Analytics, General Acceleration

Independent Video EnginesIndependent on-chip video encode and decode engines accelerate H.264 and H.265


质量 = 保证GPU处理使实时视频增强成为可能

视频稳定

图像对比度/锐度的提高

先进去噪

解封

缩小规模(Lanczos & poly-phase multi-tap filters)

超分辨率

平滑的帧速率上转换


GPU

10 Streams

75 Watt

Decode Enhance Infer Encode

NVDEC NVENC NVDEC NVENC NVDEC NVENC NVDEC NVENC1080p30 10 Mbps1080p30 5 Mbps720p30 3 Mbps720p 1.5 Mbps480p 1.2 Mbps360p 1 Mbps

360p 0.5 Mbps240p 0.3Mbps240p 0.15Mbps

1080p30 5 Mbps

GPU 转码: 10x 的吞吐量, ½ 的功耗Video processing at scale

1080p30 h.264 source, enhancement includes deblocking, motion stabilization and scaling

CPU

1 Stream

150 Watt

25

Tesla GPU 产品和路线图

26

1H 2015 2H 2015 (新品)

K40

12GB

235W

PCIe Passive

K80

2xGPU, 2x12GB

300W

PCIe Passive

TESLA M60

2xGPU, 2x8GB

300W / 225W

PCIe Passive

2015年Tesla GPU 加速器产品

TESLA M40

12GB

250W

PCIe Passive

TESLA M6

8GB

75W / 100W

MXM PCIe in definition

Fastest DL

Solution

VDI

Solution

VDI Blade

Solution

NEW!

27

为数据中心设计世界上最快的深度学习训练

MAXWELL 架构

• 24/7 Reliability

• Scalable Perf. w/ RDMA

• Datacenter mgmt. tools

up to 2.4x K40

up to 1.7x K80

Save days on each training

Tesla M40专门为深度学习而设计建造

3072 Core

~7 TFLOPS

12GB


最快的深度学习训练

Caffe uses Alexnet, batch size = 128, 280M images training set

Torch uses OverFeat, batch size = 128, 140M images training set

Save days on each training iteration

Enable users to iterate to final solution much faster

0

2

4

6

8

10

Caffe Torch

# of days to train

K40

K80

M40

M40

K80

K40Save

3 daysSave

2 days


GPUDirect RDMA

GPU之间直接传输数据

67% GPU到GPU的延迟降低

5x 高的GPU到GPU MPI 带宽

RDMA 加速扩展深度学习

Yahoo, Baidu use RDMA to speedup Deep Learning Training

“We have enhanced Caffe to use multiple GPUs on a server and benefit from RDMA tosynchronize Deep Learning models”http://yahoohadoop.tumblr.com/post/129872361846/large-scale-distributed-deep-learning-on-hadoop

“Given the properties of DL’s SGD algorithms, it is desired to have very high bandwidthand ultra low latency interconnects to minimize inter-node communication costs”http://arxiv.org/vc/arxiv/papers/1501/1501.02876v1.pdf

http://yahoohadoop.tumblr.com/post/129872361846/large-scale-distributed-deep-learning-on-hadoop

http://arxiv.org/vc/arxiv/papers/1501/1501.02876v1.pdf

30

2012 20142008 2010 2016 2018

48

36

12

0

24

60

72

TeslaFermi

Kepler

Maxwell

PascalMixed PrecisionDouble Precision3D MemoryNVLink

Volta

GPU 路线图SG

EM

M /

W

31

PASCAL GPU 新特性

NVLINKGPU high speed interconnect

Connect CPU to GPU or GPU to GPU

NVLINK 1.0, 80 GB/s, 4 Link Pairs

3D Stacked Memory4x Higher Bandwidth (~1 TB/s)

3x Larger Capacity

4x More Energy Efficient per bit

32

NVLink : 高速GPU互连

Whitepaper: http://www.nvidia.com/object/nvlink.html

PascalCPU

(NVLINK

Enabled)

GPU to CPU via NVLink GPU to GPU via NVLink

4 NVLink

20GB/s each

PCIe

Control

HBM

16-32GB

DDR Memory

10s-100s GB

1Tbyte/s

DDR4

50-75 GB/s

CPU

(x86)

Pascal Pascal

PCIe Switch

4 NVLink

20GB/s each

http://www.nvidia.com/object/nvlink.html

33

NVLinkHigh-Speed GPU Interconnect

NVLink

NVLink

POWER CPU

X86, ARM64, POWER CPU

X86, ARM64, POWER CPU

PASCAL GPUKEPLER GPU

20162014

PCIe PCIe

34

NVLink释放了Multi-GPU性能

343D FFT, ANSYS: 2 GPU configuration, All other apps comparing 4 GPU configuration AMBER Cellulose (256x128x128), FFT problem size (256^3)

TESLA

GPU

TESLA

GPU

CPU

5x Faster than

PCIe Gen3 x16

PCIe Switch

GPUs Interconnected with NVLink

1.00x

1.25x

1.50x

1.75x

2.00x

2.25x

ANSYS Fluent Multi-GPU Sort LQCD QUDA AMBER 3D FFT

Over 2x Application Performance SpeedupWhen Next-Gen GPUs Connect via NVLink Versus PCIe

Speedup vs PCIe based Server


加速数据中心的一种灵活架构

8 GPU Cube Mesh

PCIe

Switch

CPU

PCIe

Switch

CPU

x

x

NVLINK + UVM

Efficient 4-GPU and 8-GPU scaling

Pascal

Best-in-class single GPU performance

vGPU

Graphics virtualization

谢谢 !

Download - NVIDIA GPU 深度学习平台 - HPC Advisory Council · 2020. 1. 14. · Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”

Top Related