high performance deep learning clusterson-demand.gputechconf.com/supercomputing/...essential deep...
TRANSCRIPT
Julie Bernauer, November 19th 2019High Performance Deep Learning Clusters
NVIDIA ACCELERATED COMPUTING GROWTH
CRYOSPARCCryo
GROMACSChemistry
MICROVOLUTIONMicroscopy
WRFWeather
600+ CUDA APPS
FUN3DCFD
PARABRICKSGenomics
#1 World, US — ORNL Summit#1 Europe — CSCS Piz Daint
#1 Japan — AIST ABCI22 of Top 25 Energy-Efficient
50% GROWTH IN TOP50050% GROWTH OF NVIDIA DEVELOPERS
2018 2019
1.2MDEVELOPERS
+50%800K
2018 2019
13MCUDA
DOWNLOADS+60%
8M
2010 2012 2014 2016 2018
NVIDIA inWorld’s Most Energy Efficient
Supercomputers
NVIDIA in World’s Top Most Powerful
Supercomputers
3
In the age of machine learning, a powerful computing infrastructure is essential to
creating software.
4
2015
36000 Mins (25 Days)
1xK80 | 2015CUDA
2016
1200 Mins (20 Hours)DGX-1P | 2016
NVLink
2017
480 Mins (8 Hours)DGX-1V | 2017Tensor Core
6.3 Minutes on MLPerfAt Scale | 2018
DGX Cluster
2018
70 Minutes on MLPerfDGX-2H | 2018
NVSwitch
ResNet50 v1.5 training
2019
52.7 Minutes on MLPerf
DGX-2H | 2019NVSwitch
1.33 Minutes on MLPerf
At Scale | 2019DGX SuperPOD
DL Training: From Single GPU to Multi-node
5
Largest TensorFlow model at scaleOak Ridge National Lab scales TensorFlow climate analytics model up to 27,360 V100 GPUs
Source: https://arxiv.org/pdf/1810.01993.pdf
2018 Gordon Bell Prize Winner
6
MLPerf: NVIDIA ADVANCING AI TRAININGTime to Train From 8 Hours to 80 Seconds
2019 MLPerf ID (in order from top to bottom of chart): ResNet-50: 0.6-30 | Transformer: 0.6-28 | GNMT: 0.6-14 | SSD: 0.6-27 | Mini-Go: 0.6-11 | Mask R-CNN: 0.6-23
Project MegatronLargest transformer-based language model
Project Megatron
8.3B parameters8-way Model Parallel64-way Data Parallel24x larger than BERT
https://github.com/NVIDIA/Megatron-LM
SOTA: WikiText-103 perplexity 10.8
LAMBADA 66.5% accuracy(LAnguage Modeling Broadened to Account for Discourse Aspects)
https://arxiv.org/pdf/1909.08053v3.pdf
Models getting more complex
9
Hardware
From servers...
10
NVIDIA DGX-2
2 PFLOPS | 512GB HBM2 | 10kW | 350 lbs
NVLink Plane Card
8x EDR IB/100 GigE
2x Xeon Platinum
1.5TB System Memory
PCIe Switch Complex
30TB NVME SSDs
16x Tesla V100 32GB12x NVSwitch
11
DGX-2
1
2
3
5
4
6 Two Intel Xeon Platinum CPUs
7 1.5 TB System Memory
11
30 TB NVME SSDs Internal Storage
NVIDIA SXM3 Tesla V100 32GB HBM2
Two GPU Boards8 V100 32GB GPUs per board6 NVSwitches per board512GB Total HBM2 Memoryinterconnected byPlane Card
Twelve NVSwitches2.4 TB/sec bi-section
bandwidth
Eight EDR Infiniband/100 GigE1600 Gb/sec Total Bi-directional Bandwidth
PCIe Switch Complex
8
9
9Dual 10/25 Gb/secEthernet
12
Hardware
...to supercomputers
13
NVIDIA DGX SUPERPOD
Mellanox EDR 100G InfiniBand Network
Mellanox Smart Director Switches
In-Network Computing Acceleration Engines
Fast and Efficient Storage Access with RDMA
Up to 130Tb/s Switching Capacity per Switch
Ultra-Low Latency of 300ns
Integrated Network Manager
Terabit-Speed InfiniBand Networking per Node
…
Rack 1 Rack 16
ComputeBackplane
Switch
Storage Backplane
Switch
64 DGX-2
GPFS
200 Gb/s per node
800 Gb/s per node
https://mlperf.org/
INDUSTRY WIDE BENCHMARK SUITEFOR AI PERFORMANCE
• World’s largest transformer based language model ever trained (8.3 billion parameters)
24x the size of BERT (345M parameters) 5.6x the size of GPT-2 (1.5B parameter)
• Achieved 15.1 PetaFLOPs per second sustained performance over the entire application using 512 GPUs at 76% scaling efficiency
• 12 ZettaFLOPs to converge in 9.2 days
• SOTA for Lambada accuracy (66.5% compared to 63.2%) and Wikitext-103 perplexity (10.81 compare to 16.4) using 174 GB training data.
*Top figure from Huggingface DistilBERT blog post (https://medium.com/huggingface/distilbert-8cf3380435b5)
Megatron
17
Empty racks to running in 3 weeks
5km of IB cables and 1.5k GPUs that can be deployed anywhere in less than 3 weeks.
18
Software
19
NVIDIA DEEP LEARNING SDK
Powerful tools and libraries for designing and deploying GPU-accelerated deep learning applications
High performance building blocks for training and deploying deep neural networks on NVIDIA GPUs
Industry vetted deep learning algorithms and linear algebra subroutines for developing novel deep neural networks
Multi-GPU and multi-node scaling that accelerates training on up to eight GPU
High performance GPU-acceleration for deep learning
developer.nvidia.com/deep-learning-software
Deep Learning Primitives
Multi-GPU Communication
Linear Algebra
Programmable Inference Accelerator
Sparse Matrix Operations
Deep Learning for Video Analytics
20
NVIDIA COLLECTIVE COMMUNICATIONS LIBRARY (NCCL)Multi-GPU and multi-node collective communication primitives
developer.nvidia.com/nccl
High-performance multi-GPU and multi-node collective communication primitives optimized for NVIDIA GPUs
Fast routines for multi-GPU multi-node acceleration that maximizes inter-GPU bandwidth utilization
Easy to integrate and MPI compatible. Uses automatic topology detection to scale HPC and deep learning applications over PCIe and NVLink
Accelerates leading deep learning frameworks such as Caffe2, Microsoft Cognitive Toolkit, MXNet, PyTorch and more
Multi-Node:InfiniBand verbs, IP Sockets
Multi-GPU: NVLink, PCIe
Automatic Topology Detection
18.11 MxNet container, runs for perf demonstration only, not convergence runs
Blog: https://devblogs.nvidia.com/massively-scale-deep-learning-training-nccl-2-4/
21
Deep learning frameworks offer building blocks for designing, training and validating deep neural networks, through a high level programming interface.
Apply AI to challenging problems in computer vision, natural language processing and others
Research novel deep neural networks for new application areas
Deliver high-performance training with GPU-accelerated NVIDIA Deep Learning SDK libraries
Computer VisionNatural Language ProcessingSpeech and audio processingRobot learningmore…
MATLAB
NVIDIA DEEP LEARNING SDK and CUDA
developer.nvidia.com/deep-learning-frameworks
…
…
DEEP LEARNING FRAMEWORKSEssential deep learning tools for data scientists, researchers and engineers
22
NGC Containers
We built libnvidia-container to make it easy to run CUDA applications inside containers.
We release optimized container images for each of the major DL frameworks every month, and provide them for anyone to use.
We use containers for everything on our HPC clusters - R&D, official benchmarks, etc.
Containers give us portable software stacks without sacrificing performance.
23
From an infra perspective...
24
• Slurm: User job scheduling & management
• Enroot: NVIDIA open-source tool to convert traditional container/OS images into unprivileged sandboxes
• Pyxis: NVIDIA open-source plugin integrating Enroot with Slurm
• DeepOps: NVIDIA open-source toolbox for GPU cluster management w/Ansible playbooks
Scale to multiple nodesSoftware stack - System
Login nodes DGX Pod: DGX Servers w. DGX base OS
Slurm controller Enroot | DockerPyxis
NGC model containers (Pytorch, Tensorflow from 19.09)
DCGM
25
Example
Excerpts from an actual script used to launch jobs for the MLPerf v0.5 benchmark (208 LOC total)
1. Setup docker flags2. Setup mpirun flags3. Setup SSH4. Start sleep containers5. Launch mpirun in rank0
container
SLURM+Docker+MPI
26
Containers at NVIDIA
What we need
● High performance● Unprivileged runtime● Uses docker image format
What we want
● Preserve SLURM cgroups● NVIDIA+Mellanox devices are available by default● MPI between containers is easy● Can install packages inside containers
What do we need?
27
ENROOT Improved Linux utils
enroot-unshare : like unshare(1), creates new namespaces
enroot-mount : like mount(8), mounts filesystems
enroot-switchroot : like switch_root(8), changes rootfs
enroot-aufs2ovlfs : converts AUFS whiteouts to OverlayFS
enroot-mksquashovlfs : like mksquashfs(1) on top of OverlayFS
http://github.com/nvidia/enroot
29
Examples
1. No need to pass through environment variables (Pyxis inherits them all)2. No need for any of these docker args: --rm --net=host --uts=host --ipc=host --pid=host3. No need to configure mpirun (SLURM handles it)4. No need to setup SSH (PMIx doesn’t use it)
Pyxis, MPI workload
30
Integrating clusters in the development workflow
• Integrating DL-friendly tools like GitLab, Docker w/ HPC systems
Kick off 10000’s of GPU hours of tests with a single button click in GitLab
… build and package with Docker … schedule and prioritize with SLURM … on demand or on a schedule … reporting via GitLab, ELK stack, Slack, email
Emphasis on keeping things simple for users while hiding integration complexity
Ensure reproducibility and rapid triage
Supercomputer-scale CI (Continuous integration internal at NVIDIA)
Artifactory
GitLab
Docker
SLURM
ELK
31
In conclusion...
32
• Scaling requires careful consideration of algorithms and infrastructure at each step
• Optimized single-GPU model
• Efficient & scalable Allreduce library
• GPU interconnect, networking, storage
• NVIDIA platform makes scaling and more efficient
• Deep Learning Examples with SOTA accuracy and performance
• NVIDIA NGC Container with optimized multi-GPU/multi-node software stack
• Accelerated compute platform designed for performance and scaling
A powerful computing infrastructure is essential to research
Scaling is important and we’re here to help