high performance deep learning clusterson-demand.gputechconf.com/supercomputing/...essential deep...

Julie Bernauer, November 19th 2019High Performance Deep Learning Clusters

NVIDIA ACCELERATED COMPUTING GROWTH

CRYOSPARCCryo

GROMACSChemistry

MICROVOLUTIONMicroscopy

WRFWeather

600+ CUDA APPS

FUN3DCFD

PARABRICKSGenomics

#1 World, US — ORNL Summit#1 Europe — CSCS Piz Daint

#1 Japan — AIST ABCI22 of Top 25 Energy-Efficient

50% GROWTH IN TOP50050% GROWTH OF NVIDIA DEVELOPERS

2018 2019

1.2MDEVELOPERS

+50%800K

2018 2019

13MCUDA

DOWNLOADS+60%

8M

2010 2012 2014 2016 2018

NVIDIA inWorld’s Most Energy Efficient

Supercomputers

NVIDIA in World’s Top Most Powerful

Supercomputers

3

In the age of machine learning, a powerful computing infrastructure is essential to

creating software.

4

2015

36000 Mins (25 Days)

1xK80 | 2015CUDA

2016

1200 Mins (20 Hours)DGX-1P | 2016

NVLink

2017

480 Mins (8 Hours)DGX-1V | 2017Tensor Core

6.3 Minutes on MLPerfAt Scale | 2018

DGX Cluster

2018

70 Minutes on MLPerfDGX-2H | 2018

NVSwitch

ResNet50 v1.5 training

2019

52.7 Minutes on MLPerf

DGX-2H | 2019NVSwitch

1.33 Minutes on MLPerf

At Scale | 2019DGX SuperPOD

DL Training: From Single GPU to Multi-node

5

Largest TensorFlow model at scaleOak Ridge National Lab scales TensorFlow climate analytics model up to 27,360 V100 GPUs

Source: https://arxiv.org/pdf/1810.01993.pdf

2018 Gordon Bell Prize Winner

https://arxiv.org/abs/1810.01993

https://arxiv.org/pdf/1810.01993.pdf

Project MegatronLargest transformer-based language model

Project Megatron

8.3B parameters8-way Model Parallel64-way Data Parallel24x larger than BERT

https://github.com/NVIDIA/Megatron-LM

SOTA: WikiText-103 perplexity 10.8

LAMBADA 66.5% accuracy(LAnguage Modeling Broadened to Account for Discourse Aspects)

https://arxiv.org/pdf/1909.08053v3.pdf

https://github.com/NVIDIA/Megatron-LM

https://arxiv.org/pdf/1909.08053v3.pdf

Models getting more complex

9

Hardware

From servers...

10

NVIDIA DGX-2

2 PFLOPS | 512GB HBM2 | 10kW | 350 lbs

NVLink Plane Card

8x EDR IB/100 GigE

2x Xeon Platinum

1.5TB System Memory

PCIe Switch Complex

30TB NVME SSDs

16x Tesla V100 32GB12x NVSwitch

11

DGX-2

1

2

3

5

4

6 Two Intel Xeon Platinum CPUs

7 1.5 TB System Memory

11

30 TB NVME SSDs Internal Storage

NVIDIA SXM3 Tesla V100 32GB HBM2

Two GPU Boards8 V100 32GB GPUs per board6 NVSwitches per board512GB Total HBM2 Memoryinterconnected byPlane Card

Twelve NVSwitches2.4 TB/sec bi-section

bandwidth

Eight EDR Infiniband/100 GigE1600 Gb/sec Total Bi-directional Bandwidth

PCIe Switch Complex

8

9

9Dual 10/25 Gb/secEthernet

12

Hardware

...to supercomputers

13

NVIDIA DGX SUPERPOD

Mellanox EDR 100G InfiniBand Network

Mellanox Smart Director Switches

In-Network Computing Acceleration Engines

Fast and Efficient Storage Access with RDMA

Up to 130Tb/s Switching Capacity per Switch

Ultra-Low Latency of 300ns

Integrated Network Manager

Terabit-Speed InfiniBand Networking per Node

…

Rack 1 Rack 16

ComputeBackplane

Switch

Storage Backplane

Switch

64 DGX-2

GPFS

200 Gb/s per node

800 Gb/s per node

14

https://www.top500.org

The top500 list

https://www.top500.org

https://mlperf.org/

INDUSTRY WIDE BENCHMARK SUITEFOR AI PERFORMANCE

• World’s largest transformer based language model ever trained (8.3 billion parameters)

24x the size of BERT (345M parameters) 5.6x the size of GPT-2 (1.5B parameter)

• Achieved 15.1 PetaFLOPs per second sustained performance over the entire application using 512 GPUs at 76% scaling efficiency

• 12 ZettaFLOPs to converge in 9.2 days

• SOTA for Lambada accuracy (66.5% compared to 63.2%) and Wikitext-103 perplexity (10.81 compare to 16.4) using 174 GB training data.

*Top figure from Huggingface DistilBERT blog post (https://medium.com/huggingface/distilbert-8cf3380435b5)

Megatron

https://medium.com/huggingface/distilbert-8cf3380435b5

17

Empty racks to running in 3 weeks

5km of IB cables and 1.5k GPUs that can be deployed anywhere in less than 3 weeks.

18

Software

19

NVIDIA DEEP LEARNING SDK

Powerful tools and libraries for designing and deploying GPU-accelerated deep learning applications

High performance building blocks for training and deploying deep neural networks on NVIDIA GPUs

Industry vetted deep learning algorithms and linear algebra subroutines for developing novel deep neural networks

Multi-GPU and multi-node scaling that accelerates training on up to eight GPU

High performance GPU-acceleration for deep learning

developer.nvidia.com/deep-learning-software

Deep Learning Primitives

Multi-GPU Communication

Linear Algebra

Programmable Inference Accelerator

Sparse Matrix Operations

Deep Learning for Video Analytics

http://developer.nvidia.com/deep-learning-software

20

NVIDIA COLLECTIVE COMMUNICATIONS LIBRARY (NCCL)Multi-GPU and multi-node collective communication primitives

developer.nvidia.com/nccl

High-performance multi-GPU and multi-node collective communication primitives optimized for NVIDIA GPUs

Fast routines for multi-GPU multi-node acceleration that maximizes inter-GPU bandwidth utilization

Easy to integrate and MPI compatible. Uses automatic topology detection to scale HPC and deep learning applications over PCIe and NVLink

Accelerates leading deep learning frameworks such as Caffe2, Microsoft Cognitive Toolkit, MXNet, PyTorch and more

Multi-Node:InfiniBand verbs, IP Sockets

Multi-GPU: NVLink, PCIe

Automatic Topology Detection

18.11 MxNet container, runs for perf demonstration only, not convergence runs

Blog: https://devblogs.nvidia.com/massively-scale-deep-learning-training-nccl-2-4/

http://developer.nvidia.com/nccl

https://devblogs.nvidia.com/massively-scale-deep-learning-training-nccl-2-4/

21

Deep learning frameworks offer building blocks for designing, training and validating deep neural networks, through a high level programming interface.

Apply AI to challenging problems in computer vision, natural language processing and others

Research novel deep neural networks for new application areas

Deliver high-performance training with GPU-accelerated NVIDIA Deep Learning SDK libraries

Computer VisionNatural Language ProcessingSpeech and audio processingRobot learningmore…

MATLAB

NVIDIA DEEP LEARNING SDK and CUDA

developer.nvidia.com/deep-learning-frameworks

…

…

DEEP LEARNING FRAMEWORKSEssential deep learning tools for data scientists, researchers and engineers

http://developer.nvidia.com/deep-learning-frameworks

22

NGC Containers

We built libnvidia-container to make it easy to run CUDA applications inside containers.

We release optimized container images for each of the major DL frameworks every month, and provide them for anyone to use.

We use containers for everything on our HPC clusters - R&D, official benchmarks, etc.

Containers give us portable software stacks without sacrificing performance.

https://github.com/nvidia/libnvidia-container

https://ngc.nvidia.com/catalog/containers?quickFilter=deep-learning

23

From an infra perspective...

24

• Slurm: User job scheduling & management

• Enroot: NVIDIA open-source tool to convert traditional container/OS images into unprivileged sandboxes

• Pyxis: NVIDIA open-source plugin integrating Enroot with Slurm

• DeepOps: NVIDIA open-source toolbox for GPU cluster management w/Ansible playbooks

Scale to multiple nodesSoftware stack - System

Login nodes DGX Pod: DGX Servers w. DGX base OS

Slurm controller Enroot | DockerPyxis

NGC model containers (Pytorch, Tensorflow from 19.09)

DCGM

https://github.com/NVIDIA/deepops/blob/master/docs/slurm-perf-cluster.md

https://github.com/nvidia/enroot

https://github.com/nvidia/pyxis

https://github.com/nvidia/deepops

25

Example

Excerpts from an actual script used to launch jobs for the MLPerf v0.5 benchmark (208 LOC total)

1. Setup docker flags2. Setup mpirun flags3. Setup SSH4. Start sleep containers5. Launch mpirun in rank0

container

SLURM+Docker+MPI

https://github.com/mlperf/training_results_v0.5/blob/master/v0.5.0/nvidia/submission/code/image_classification/mxnet/run.sub

26

Containers at NVIDIA

What we need

● High performance● Unprivileged runtime● Uses docker image format

What we want

● Preserve SLURM cgroups● NVIDIA+Mellanox devices are available by default● MPI between containers is easy● Can install packages inside containers

What do we need?

27

ENROOT Improved Linux utils

enroot-unshare : like unshare(1), creates new namespaces

enroot-mount : like mount(8), mounts filesystems

enroot-switchroot : like switch_root(8), changes rootfs

enroot-aufs2ovlfs : converts AUFS whiteouts to OverlayFS

enroot-mksquashovlfs : like mksquashfs(1) on top of OverlayFS

http://github.com/nvidia/enroot

http://github.com/nvidia/enroot

28

PyxisSlurm spank plugin for Enroot

http://github.com/nvidia/pyxis

http://github.com/nvidia/pyxis

29

Examples

1. No need to pass through environment variables (Pyxis inherits them all)2. No need for any of these docker args: --rm --net=host --uts=host --ipc=host --pid=host3. No need to configure mpirun (SLURM handles it)4. No need to setup SSH (PMIx doesn’t use it)

Pyxis, MPI workload

30

Integrating clusters in the development workflow

• Integrating DL-friendly tools like GitLab, Docker w/ HPC systems

Kick off 10000’s of GPU hours of tests with a single button click in GitLab

… build and package with Docker … schedule and prioritize with SLURM … on demand or on a schedule … reporting via GitLab, ELK stack, Slack, email

Emphasis on keeping things simple for users while hiding integration complexity

Ensure reproducibility and rapid triage

Supercomputer-scale CI (Continuous integration internal at NVIDIA)

Artifactory

GitLab

Docker

SLURM

ELK

31

In conclusion...

32

• Scaling requires careful consideration of algorithms and infrastructure at each step

• Optimized single-GPU model

• Efficient & scalable Allreduce library

• GPU interconnect, networking, storage

• NVIDIA platform makes scaling and more efficient

• Deep Learning Examples with SOTA accuracy and performance

• NVIDIA NGC Container with optimized multi-GPU/multi-node software stack

• Accelerated compute platform designed for performance and scaling

A powerful computing infrastructure is essential to research

Scaling is important and we’re here to help

high performance deep learning clusterson-demand.gputechconf.com/supercomputing/...essential deep...

Documents