![Page 2: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/2.jpg)
Introductions
Robert Alexander
— CUDA Tools Software Engineer at NVIDIA
— Tesla Software Group
![Page 3: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/3.jpg)
Overview
Management and Monitoring APIs
Kepler Tesla Power Management Features
Health and Diagnostics
![Page 4: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/4.jpg)
Monitoring and Management
NVIDIA Display Driver
NVML
C API
nvidia-smi
Command
line
pyNVML
Python API
nvidia::ml
Perl API
![Page 5: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/5.jpg)
NVML Supported OSes
Windows:
— Windows 7
— Windows Server 2008 R2
— 64 bit only
Linux:
— All supported by driver
— 32-bit
— 64-bit
![Page 6: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/6.jpg)
NVML Supported GPUs
NVIDIA Tesla Brand:
— All
NVIDIA Quadro Brand:
— Kepler – All
— Fermi - 4000, 5000, 6000, 7000, M2070-Q
NVIDIA VGX Brand:
— All
— Supported in the hypervisor
![Page 7: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/7.jpg)
NVML Queries
• Board serial number, GPU UUID
• PCI Information
• GPU utilization, memory utilization, pstate
• GPU compute processes, PIDs
• Power draw, temperature, fan speeds
• Clocks
• ECC errors
• Events API
![Page 8: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/8.jpg)
NVML Commands
• Enable or Disable ECC
• Change Compute mode
• Applies only to CUDA
• Default – multiple contexts
• Exclusive Process
• Exclusive Thread
• Prohibited
• Change persistence mode (Linux)
• Keeps NVIDIA driver loaded
![Page 9: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/9.jpg)
nvidia-smi
ralexander@ralexander-test:~> nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Wed May 16 11:24:16 2012
Driver Version : 295.54
Attached GPUs : 1
GPU 0000:02:00.0
Product Name : Tesla C2050
Display Mode : Disabled
Persistence Mode : Disabled
![Page 10: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/10.jpg)
nvidia-smi
Driver Model
Current : N/A
Pending : N/A
Serial Number : xxxxxxxxxx
GPU UUID : GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
VBIOS Version : 70.00.23.00.02
Inforom Version
OEM Object : 1.0
ECC Object : 1.0
Power Management Object : N/A
PCI
Bus : 0x02
Device : 0x00
Domain : 0x0000
Device Id : 0x06D110DE
![Page 11: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/11.jpg)
nvidia-smi
Bus Id : 0000:02:00.0
Sub System Id : 0x077110DE
GPU Link Info
PCIe Generation
Max : 2
Current : 2
Link Width
Max : 16x
Current : 16x
Fan Speed : 30 %
Performance State : P0
Memory Usage
Total : 2687 MB
Used : 6 MB
Free : 2681 MB
![Page 12: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/12.jpg)
nvidia-smi
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Total : 0
![Page 13: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/13.jpg)
nvidia-smi
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Total : 0
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : 0
![Page 14: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/14.jpg)
nvidia-smi
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : 0
Temperature
Gpu : 56 C
Power Readings
Power Management : N/A
Power Draw : N/A
Power Limit : N/A
![Page 15: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/15.jpg)
nvidia-smi
Clocks
Graphics : 573 MHz
SM : 1147 MHz
Memory : 1494 MHz
Max Clocks
Graphics : 573 MHz
SM : 1147 MHz
Memory : 1500 MHz
Compute Processes : None
![Page 16: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/16.jpg)
nvidia-smi - XML
ralexander@ralexander-test:~> nvidia-smi -q -x
<?xml version="1.0" ?>
<!DOCTYPE nvidia_smi_log SYSTEM "nvsmi_device_v3.dtd">
<nvidia_smi_log>
<timestamp>Wed May 16 11:33:28 2012</timestamp>
<driver_version>295.54</driver_version>
<attached_gpus>1</attached_gpus>
<gpu id="0000:02:00.0">
<product_name>Tesla C2050</product_name>
<display_mode>Disabled</display_mode>
<persistence_mode>Disabled</persistence_mode>
<driver_model>
<current_dm>N/A</current_dm>
<pending_dm>N/A</pending_dm>
…
![Page 17: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/17.jpg)
pyNVML Example
ralexander@ralexander-test $ python
>>> from pynvml import *
>>> nvmlInit()
>>> count = nvmlDeviceGetCount()
>>> for index in range(count):
... h = nvmlDeviceGetHandleByIndex(index)
... print nvmlDeviceGetName(h)
Tesla C2075
Tesla C2075
![Page 18: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/18.jpg)
pyNVML Example Continued
>>> gpu = nvmlDeviceGetHandleByIndex(0)
>>> print nvmlDeviceGetClockInfo(gpu, NVML_CLOCK_SM)
101
>>> print nvmlDeviceGetMaxClockInfo(gpu, NVML_CLOCK_SM)
1147
>>> print nvmlDeviceGetPowerUsage(gpu)
31899
>>> nvmlShutdown()
In milliwatts
Also:
NVML_CLOCK_GRAPHICS
NVML_CLOCK_MEM
In megahertz
![Page 19: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/19.jpg)
Downloads
NVML SDK
— http://developer.nvidia.com/tesla-deployment-kit
Python NVML Bindings
— http://pypi.python.org/pypi/nvidia-ml-py/
Perl NVML Bindings
— http://search.cpan.org/~nvbinding/nvidia-ml-pl/
![Page 20: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/20.jpg)
Adaptive Computing
Bright Computing
Platform Computing
Third Party Software
Ganglia
Penguin Computing
Univa
![Page 21: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/21.jpg)
Ganglia – NVML plug-in
Data from http://www.ncsa.illinois.edu/
![Page 22: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/22.jpg)
Out of Band API
Why Out of band?
— Doesn’t use CPU or operating system
— Lights out management
— Minimize performance jitter
Subset of in band functionality
— ECC
— Power Draw
— Temperature
— Static info – Serial number, UUID
![Page 23: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/23.jpg)
Out of Band API
Requires system vendor integration
BMC can control and monitor GPU
— Control system fans based on GPU temperature
IPMI may be exposed
![Page 24: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/24.jpg)
Kepler Power Management
New Kepler Only APIs
Set Power Limit
Set Fixed Maximum Clocks
Query Performance Limiting Factors
![Page 25: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/25.jpg)
Set Power Limit
Limit the amount of power GPU can consume
Exposed in NVML and nvidia-smi
Set power budgets and power policies
![Page 26: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/26.jpg)
Set Fixed Maximum Clocks
From a set of supported clocks
Will be overridden by over power or over thermal events
Fixed performance when multiple GPUs operate in lock step
— Equivalent Performance
— Reliable Performance
— Save Power
![Page 27: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/27.jpg)
Lockstep with Dynamic Performance
GPU0
GPU1
GPU2
GPU3
Time
![Page 28: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/28.jpg)
Lockstep with Dynamic Performance
GPU0
GPU1
GPU2
GPU3
Thermal Event
Time
![Page 29: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/29.jpg)
Lockstep with Dynamic Performance
GPU0
GPU1
GPU2
GPU3
Time
![Page 30: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/30.jpg)
Lockstep with Dynamic Performance
GPU0
GPU1
GPU2
GPU3
Time
![Page 31: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/31.jpg)
Query Performance Limiting Factors
GPU clocks will adjust based on environment
Over thermal or over power limits GPU will reduce
performance
![Page 32: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/32.jpg)
Health and Diagnostics
nvidia-healthmon
—Quick health check
—Suggest remedies to SW and system configuration
problems
—Help users help themselves
![Page 33: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/33.jpg)
nvidia-healthmon
What it’s not
—Not a full HW diagnostic
—Not comprehensive
![Page 34: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/34.jpg)
nvidia-healthmon – Feature Set
Basic CUDA and NVML sanity check
Diagnosis of GPU failure-to-initialize problems
Check for conflicting drivers (I.E. VESA)
InfoROM validation
Poorly seated GPU detection
Check for disconnected power cables
ECC error detection and reporting
Bandwidth test
![Page 35: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/35.jpg)
nvidia-healthmon – Use Cases
Cluster scheduler’s prologue / epilogue script
Heath and diagnostic suites
— Designed to integrate into third party tools
After provisioning cluster node
Run directly, manually
![Page 36: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/36.jpg)
nvidia-healthmon
ralexander@ralexander-test:~> ./nvidia-healthmon
Loading Config: SUCCESS
Global Tests:
NVML Sanity: SUCCESS
Tesla Devices Count: SKIPPED
Result: 1 success, 0 errors, 0 warnings, 1 did not run
-----------------------------------------------------------
GPU 0000:02:00.0 #0 : Tesla C2050 (Serial: xxxxxxxxxxx):
NVML Sanity: SUCCESS
![Page 37: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/37.jpg)
nvidia-healthmon
InfoROM: SKIPPED
GEMENI InfoROM: SKIPPED
ECC: SUCCESS
CUDA Sanity: SUCCESS
PCIe Maximum Link Generation: SKIPPED
PCIe Maximum Link Width: SKIPPED
PCI Seating: SUCCESS
PCI Bandwidth: SKIPPED
Result: 4 success, 0 errors, 0 warnings, 5 did not run
5 success, 0 errors, 0 warnings, 6 did not run
WARNING: One or more tests didn't run.
![Page 38: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/38.jpg)
Contact Info and Links
Robert Alexander
developer.nvidia.com/nvidia-management-library-nvml
forums.nvidia.com
![Page 39: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/39.jpg)
Thanks!
![Page 40: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/40.jpg)
Questions?
![Page 41: Tesla Cluster Monitoring & Management - GTC 2012](https://reader035.vdocument.in/reader035/viewer/2022071601/613d3710736caf36b75aae78/html5/thumbnails/41.jpg)
Tesla Cluster Monitoring & Management