best practices for designing, deploying, and managing gpu ... · tesla accelerated computing...

19
Dale Southard BEST PRACTICES FOR DESIGNING, DEPLOYING, AND MANAGING GPU CLUSTERS

Upload: others

Post on 21-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in

Dale Southard

BEST PRACTICES FOR DESIGNING,

DEPLOYING, AND MANAGING GPU CLUSTERS

Page 2: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in

2

Cluster Design

Page 3: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in

3

Development Data Center Infrastructure

Tesla Accelerated Computing Platform

GPU

Accelerators Interconnect

System

Management

Compiler

Solutions

GPU Boost

GPU Direct

NVLink

NVML

LLVM

Profile and

Debug

CUDA Debugging API

Development

Tools

Programming

Languages

Infrastructure

Management Communication System Solutions

/

Software

Solutions

Libraries

cuBLAS

Page 4: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in

4

TESLA: PLATFORM WITH OPEN ECOSYSTEM

Pick the CPU that’s Correct for You

x86

Libraries

Programming

Languages

Compiler

Directives

AmgX

cuBLAS

/

Page 5: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in

5

DESIGNING FOR WORKLOAD

Multiple Ways to Balance Parallel and Serial Performance

GPU

GPU

GPU

GPU

CPU

NVLink PCIe Gen3 x16

GPU

GPU

CPU 80GB/s

16GB/s

2 GPUs per CPU

GPU

GPU

GPU CPU

3 GPUs per CPU

PCIe Switch

40 GB/s

20 GB/s

40 GB/s

16GB/s 20GB/s

4 GPUs per CPU

Page 6: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in

6

IN SITU VIS – FASTER SCIENCE, LOWER COST

Traditional Workflow

CPU Supercomputer Viz Cluster

Simulation- 1 Week Viz- 1 Day

Multiple Iterations

Time to Discovery =

Months

Tesla Platform Faster Time to Discovery

Reduced Pressure on Filestore

GPU-Accelerated Supercomputer

Visualize while

you simulate

Restart Simulation Instantly

Multiple Iterations

Time to Discovery = Weeks

Page 7: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in

7

Cluster Deployment

Page 8: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in

8

NODE TOPOLOGY CIRCA 2008

Life Was Good

CPU CPU

Bridge

GPU

Memory

Mem

ory

Mem

ory

GPU

Page 9: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in

9

NODE TOPOLOGY IN 2014 AND BEYOND

Better, but More Choices

CPU CPU

GPU

Memory

Mem

ory

Mem

ory

GPU

Memory

GPU

Memory

GPU

Memory

… …

Page 10: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in

10

TOPOLOGY-AWARE RESOURCE MANAGERS

$CUDA_VISIBLE_DEVCIES today, cgroups coming soon

CPU CPU

GPU GPU GPU GPU

Job 1

Job 3

Job 2

Page 11: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in

11

Cluster Management

Page 12: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in

12

TESLA GPU DEPLOYMENT KIT

nvidia-smi provides a command-line interface

NVML provides API access

https://developer.nvidia.com/gpu-deployment-kit

Command-line and Library

Page 13: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in

13

COMPUTE MODE

The Compute Mode setting controls simultaneous use

DEFAULT allow multiple simultaneous processes

EXCLUSIVE_THREAD allows only one context

EXCLUSIVE_PROCESS one process, but multiple threads

PROHIBITED

Can be set by command–line (nvidia-smi) & API (NVML)

Insurance Against Scheduling Issues

Page 14: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in

14

RESOURCE LIMITS

UVA depends on allocating virtual address space

Virtual address space != physical ram consumption

Several batch systems and resource mangers support cgroups,

either directly or via plugins.

Use cgroups, Not ulimit

Page 15: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in

15

MONITORING

Environmental and utilization metrics are available

Tesla cards may provide OoB access to telemetry via BMC

NVML support has been integrated into many monitoring systems

Open Source, Commercial, DIY

Page 16: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in

16

GPU ACCOUNTING

Per-process accounting of GPU usage by PID

Accessible by library (NVML) or command-line (nvidia-smi)

Enable accounting mode:

sudo nvidia-smi –am 1

Human-readable accounting

nvidia-smi –q –d ACCOUNTING

CSV accounting data

nvidia-smi –query-accounted-

apps=gpu_name,gpu_util –

format=csv

Page 17: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in

17

Summary

Page 18: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in

18

TESLA ACCELERATED COMPUTING PLATFORM

Pick the HW mix that’s right for your site

Topology Matters in design and in resource management

Rich ecosystem of management hooks and tools

Libraries, Languages, Tools

Page 19: Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING PLATFORM Pick the HW mix that’s right for your site Topology Matters in design and in

19

Thanks