an analysis of gpu utilization trends on the keeneland

July 18, 2012

An Analysis of GPU Utilization Trends on the Keeneland Initial Delivery System Tabitha K Samuel, Stephen McNally, John Wynkoop National Institute for Computational Sciences

The Keeneland Project

•  5 year track 2 D coopera.ve agreement awarded by the NSF •  Partners – Georgia Tech, Oak Ridge Na.onal Lab, Na.onal

Ins.tute for Computa.onal Sciences and the University of Tennessee

•  Keeneland Ini.al Delivery System (KIDS) is being used to develop programming tools and libraries for a GPGPU plaMorm

Keeneland Partners

KIDS Specifications

Node architecture HP ProLiant SL390 G7

CPU Intel Xeon X5660 (Westmere)/ 12 cores per node

Host memory per node 24 Gbytes

GPU Architecture Nvidia Tesla M2090 (Fermi)

GPUs per node 3

GPU memory per node 18 Gbytes (6 Gbytes per GPU)

CPU:GPU ra.o 2:3

Interconnect InfiniBand QDR (single rail)

Total number of nodes 120

Total CPU cores 1,440

Total GPU cores 161,280

Need for a monitoring tool

•  Most applica.ons did not have the appropriate administra.ve tools and vendor support.

•  GPU administra.on has largely been an aaerthought as vendors in this space are focused on gaming and video applica.ons.

•  There is a compelling need to monitor GPU u.liza.on on Keeneland for the purposes of proper system administra.on and future planning for Keeneland Final System

Design of the monitoring tool

•  In CUDA 4.1, NVIDIA provided enhanced func.onality for the nvidia-‐system management interface (nvidia-‐smi)

•  NVML -‐ NVIDIA Management Library – C-‐based API for monitoring and managing various states of the NVIDIA GPU devices.

–  It provides a direct access to the queries and commands exposed via nvidia-‐smi.

– Data is presented in plain text or xml format

Sample output of nvidia-smi -q -d utilization

Design of monitoring tool

Database

…… Bash Script

Tmp file

Python script

Compute node 60

Bash Script

Tmp file

Python script

Compute node 2

Bash Script

Tmp file

Python script

Compute node 1

Design of monitoring tool

•  If script throws an excep.on, an email is sent to the system administrators

•  Script run by cron on 60 service nodes on Keeneland – Run every 30 minutes – Every run produces 8kb of data

Analysis of Data

•  CPU U.liza.on and Overall GPU U.liza.on

Total GPU u.liza.on, when compared to CPU u.liza.on, is rela.vely low

CPU Utilization and Overall GPU Utilization

•  Possible reasons for low u.liza.on of GPUs Low U.liza.on

Applica.on’s ability to fully

u.lize all GPUs in a mul. GPU environment

Limited bandwidth per FLOP available out of a single compute node

Ability of an applica.on to fully u.lize the

performance of a single GPU.

CPU Utilization and Overall GPU Utilization – Caveats

•  KIDS is a developmental system hence it is difficult to assert if lack of u.liza.on is due to deficiencies in applica.on or if the user is ar.ficially limi.ng GPU usage during tes.ng or debugging

•  Further development of the toolset is intended to give more granular data, allowing more accurate conclusions

Overall GPU Utilization by Application

0 20 40 60 80

100

Perc

enta

ge U

tiliz

atio

n

Software Package

GPU and Memory Utilization by Software Package

Average GPU Utilization

Average Memory Utilization

•  Several applica.ons have GPU u.liza.ons > 50% on an average

•  Memory u.liza.on is significantly lower than GPU u.liza.on

•  Unclear if this is due to bandwidth constraints, applica.on design or other factors

CPU Utilization and Requested GPU Utilization

0

20

40

60

80

100

Percen

tage U+liza+

on

Timeline

CPU U+liza+on vs Requested GPU U+liza+on

CPU U.liza.on

U.lza.on of Requested GPUs

•  Applica.ons that do request GPUs, make reasonable u.liza.on of them

CPU Utilization and Requested GPU Utilization

0 20 40 60 80 100

Percen

tage U+liza+

on

Timeline

CPU U+liza+on vs Overall GPU U+liza+on

CPU U.liza.on Overall GPU U.liza.on

0

20

40

60

80

100

Percen

tage U+liza+

on

Timeline

CPU U+liza+on vs Requested GPU U+liza+on

CPU U.liza.on U.lza.on of Requested GPUs

•  Possible reasons for this significant difference: – User could be limi.ng the scope of the applica.on for tes.ng and debugging

–  Applica.ons cannot adequately scale past one GPU per process due to limita.ons in the code or the limited inter-‐node bandwidth.

Number of jobs and number of GPUs requested per job

0 200 400 600 800 1000

# of

Job

s

Number of GPUs requested

Number of Jobs vs Number of GPUs/Job (> 3 GPUs)

0 10000 20000 30000 40000

# of

Job

s

Number of GPUs Requested

Number of Jobs vs Number of GPUs/Job (Overall)

•  Majority of jobs on KIDS request fewer than 3 GPUs – Due to large number of very small, short jobs being used for applica.on development

– Once system is in produc.on, this number should dras.cally increase

Issues encountered during development of toolkit

•  Large volume of data generated by the output of the nvidia-‐smi u.lity –  Future versions of NVML should allow administrators to select only relevant data

•  Failure mode of the nvidia-‐smi tool is unpredictable when there is a poten.ally faulty GPU in the system –  Tool some.mes generates erra.c output, no output or seemingly normal output

–  This makes diagnosing problem GPUs on a large scale difficult

Other monitoring tools

•  Provides CLI & GUI interface

•  Tool that can be used for management, provisioning and monitoring hybrid HP systems.

HP Insight Cluster Management U.lity

•  Uses python binding for NVML

•  Allows simplified access to GPU metrics like temperature, memory usage and u.liza.on

Ganglia’s Gmond Python module

Comparison with other tools

•  Gmond presents data in RRD format which is an abbreviated, averaged version of data

•  Extremely high level, which is not useful if you are trying to understand u.liza.on at a par.cular moment in .me

•  Our tool collects data over .me and does not average it

•  Allows us to maintain granularity much farther into the future

•  Useful in scenarios where you can correlate GPU usage with ECC errors

•  Easy to get sta.s.cs of u.liza.on with respect to job sizes, wall clock requests, GPU requests etc.

Other considera.ons: •  Commercial tools were expensive •  Open source alterna.ves were early in produc.on and development •  Had a pressing need to provide very specific data to our review panel

Conclusions

•  This tool provides an important first step in crea.ng an open sourced tool for collec.on of u.liza.on sta.s.cs for GPU based systems

•  Not many monitoring tools are available for GPU systems, few that are, are expensive or in early development

•  High level study of data reveals that soaware barring a few, are s.ll CPU cycle heavy and do not take full advantage of the processing power of GPUs

Future Work

•  Collec.on of other sta.s.cs such as ECC errors, power and temperature sta.s.cs

•  Collec.on of sta.s.cs on a more frequent basis •  Collaborate with soaware developers to mine the data generated by this tool – Data can be used to aid soaware development for GPU systems

– Data can also be used to determine appropriate CPU:GPU ra.os for jobs and assist in crea.ng scheduling policies

Questions

an analysis of gpu utilization trends on the keeneland

Documents