tesla cluster monitoring & management - gtc 2012

Post on 12-Sep-2021

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Tesla Cluster Monitoring & Management

Introductions

Robert Alexander

— CUDA Tools Software Engineer at NVIDIA

— Tesla Software Group

Overview

Management and Monitoring APIs

Kepler Tesla Power Management Features

Health and Diagnostics

Monitoring and Management

NVIDIA Display Driver

NVML

C API

nvidia-smi

Command

line

pyNVML

Python API

nvidia::ml

Perl API

NVML Supported OSes

Windows:

— Windows 7

— Windows Server 2008 R2

— 64 bit only

Linux:

— All supported by driver

— 32-bit

— 64-bit

NVML Supported GPUs

NVIDIA Tesla Brand:

— All

NVIDIA Quadro Brand:

— Kepler – All

— Fermi - 4000, 5000, 6000, 7000, M2070-Q

NVIDIA VGX Brand:

— All

— Supported in the hypervisor

NVML Queries

• Board serial number, GPU UUID

• PCI Information

• GPU utilization, memory utilization, pstate

• GPU compute processes, PIDs

• Power draw, temperature, fan speeds

• Clocks

• ECC errors

• Events API

NVML Commands

• Enable or Disable ECC

• Change Compute mode

• Applies only to CUDA

• Default – multiple contexts

• Exclusive Process

• Exclusive Thread

• Prohibited

• Change persistence mode (Linux)

• Keeps NVIDIA driver loaded

nvidia-smi

ralexander@ralexander-test:~> nvidia-smi -q

==============NVSMI LOG==============

Timestamp : Wed May 16 11:24:16 2012

Driver Version : 295.54

Attached GPUs : 1

GPU 0000:02:00.0

Product Name : Tesla C2050

Display Mode : Disabled

Persistence Mode : Disabled

nvidia-smi

Driver Model

Current : N/A

Pending : N/A

Serial Number : xxxxxxxxxx

GPU UUID : GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

VBIOS Version : 70.00.23.00.02

Inforom Version

OEM Object : 1.0

ECC Object : 1.0

Power Management Object : N/A

PCI

Bus : 0x02

Device : 0x00

Domain : 0x0000

Device Id : 0x06D110DE

nvidia-smi

Bus Id : 0000:02:00.0

Sub System Id : 0x077110DE

GPU Link Info

PCIe Generation

Max : 2

Current : 2

Link Width

Max : 16x

Current : 16x

Fan Speed : 30 %

Performance State : P0

Memory Usage

Total : 2687 MB

Used : 6 MB

Free : 2681 MB

nvidia-smi

Compute Mode : Default

Utilization

Gpu : 0 %

Memory : 0 %

Ecc Mode

Current : Enabled

Pending : Enabled

ECC Errors

Volatile

Single Bit

Device Memory : 0

Register File : 0

L1 Cache : 0

L2 Cache : 0

Total : 0

nvidia-smi

Double Bit

Device Memory : 0

Register File : 0

L1 Cache : 0

L2 Cache : 0

Total : 0

Aggregate

Single Bit

Device Memory : N/A

Register File : N/A

L1 Cache : N/A

L2 Cache : N/A

Total : 0

nvidia-smi

Double Bit

Device Memory : N/A

Register File : N/A

L1 Cache : N/A

L2 Cache : N/A

Total : 0

Temperature

Gpu : 56 C

Power Readings

Power Management : N/A

Power Draw : N/A

Power Limit : N/A

nvidia-smi

Clocks

Graphics : 573 MHz

SM : 1147 MHz

Memory : 1494 MHz

Max Clocks

Graphics : 573 MHz

SM : 1147 MHz

Memory : 1500 MHz

Compute Processes : None

nvidia-smi - XML

ralexander@ralexander-test:~> nvidia-smi -q -x

<?xml version="1.0" ?>

<!DOCTYPE nvidia_smi_log SYSTEM "nvsmi_device_v3.dtd">

<nvidia_smi_log>

<timestamp>Wed May 16 11:33:28 2012</timestamp>

<driver_version>295.54</driver_version>

<attached_gpus>1</attached_gpus>

<gpu id="0000:02:00.0">

<product_name>Tesla C2050</product_name>

<display_mode>Disabled</display_mode>

<persistence_mode>Disabled</persistence_mode>

<driver_model>

<current_dm>N/A</current_dm>

<pending_dm>N/A</pending_dm>

pyNVML Example

ralexander@ralexander-test $ python

>>> from pynvml import *

>>> nvmlInit()

>>> count = nvmlDeviceGetCount()

>>> for index in range(count):

... h = nvmlDeviceGetHandleByIndex(index)

... print nvmlDeviceGetName(h)

Tesla C2075

Tesla C2075

pyNVML Example Continued

>>> gpu = nvmlDeviceGetHandleByIndex(0)

>>> print nvmlDeviceGetClockInfo(gpu, NVML_CLOCK_SM)

101

>>> print nvmlDeviceGetMaxClockInfo(gpu, NVML_CLOCK_SM)

1147

>>> print nvmlDeviceGetPowerUsage(gpu)

31899

>>> nvmlShutdown()

In milliwatts

Also:

NVML_CLOCK_GRAPHICS

NVML_CLOCK_MEM

In megahertz

Adaptive Computing

Bright Computing

Platform Computing

Third Party Software

Ganglia

Penguin Computing

Univa

Ganglia – NVML plug-in

Data from http://www.ncsa.illinois.edu/

Out of Band API

Why Out of band?

— Doesn’t use CPU or operating system

— Lights out management

— Minimize performance jitter

Subset of in band functionality

— ECC

— Power Draw

— Temperature

— Static info – Serial number, UUID

Out of Band API

Requires system vendor integration

BMC can control and monitor GPU

— Control system fans based on GPU temperature

IPMI may be exposed

Kepler Power Management

New Kepler Only APIs

Set Power Limit

Set Fixed Maximum Clocks

Query Performance Limiting Factors

Set Power Limit

Limit the amount of power GPU can consume

Exposed in NVML and nvidia-smi

Set power budgets and power policies

Set Fixed Maximum Clocks

From a set of supported clocks

Will be overridden by over power or over thermal events

Fixed performance when multiple GPUs operate in lock step

— Equivalent Performance

— Reliable Performance

— Save Power

Lockstep with Dynamic Performance

GPU0

GPU1

GPU2

GPU3

Time

Lockstep with Dynamic Performance

GPU0

GPU1

GPU2

GPU3

Thermal Event

Time

Lockstep with Dynamic Performance

GPU0

GPU1

GPU2

GPU3

Time

Lockstep with Dynamic Performance

GPU0

GPU1

GPU2

GPU3

Time

Query Performance Limiting Factors

GPU clocks will adjust based on environment

Over thermal or over power limits GPU will reduce

performance

Health and Diagnostics

nvidia-healthmon

—Quick health check

—Suggest remedies to SW and system configuration

problems

—Help users help themselves

nvidia-healthmon

What it’s not

—Not a full HW diagnostic

—Not comprehensive

nvidia-healthmon – Feature Set

Basic CUDA and NVML sanity check

Diagnosis of GPU failure-to-initialize problems

Check for conflicting drivers (I.E. VESA)

InfoROM validation

Poorly seated GPU detection

Check for disconnected power cables

ECC error detection and reporting

Bandwidth test

nvidia-healthmon – Use Cases

Cluster scheduler’s prologue / epilogue script

Heath and diagnostic suites

— Designed to integrate into third party tools

After provisioning cluster node

Run directly, manually

nvidia-healthmon

ralexander@ralexander-test:~> ./nvidia-healthmon

Loading Config: SUCCESS

Global Tests:

NVML Sanity: SUCCESS

Tesla Devices Count: SKIPPED

Result: 1 success, 0 errors, 0 warnings, 1 did not run

-----------------------------------------------------------

GPU 0000:02:00.0 #0 : Tesla C2050 (Serial: xxxxxxxxxxx):

NVML Sanity: SUCCESS

nvidia-healthmon

InfoROM: SKIPPED

GEMENI InfoROM: SKIPPED

ECC: SUCCESS

CUDA Sanity: SUCCESS

PCIe Maximum Link Generation: SKIPPED

PCIe Maximum Link Width: SKIPPED

PCI Seating: SUCCESS

PCI Bandwidth: SKIPPED

Result: 4 success, 0 errors, 0 warnings, 5 did not run

5 success, 0 errors, 0 warnings, 6 did not run

WARNING: One or more tests didn't run.

Thanks!

Questions?

Tesla Cluster Monitoring & Management

top related