an analysis of gpu utilization trends on the keeneland

22
July 18, 2012 An Analysis of GPU Utilization Trends on the Keeneland Initial Delivery System Tabitha K Samuel, Stephen McNally, John Wynkoop National Institute for Computational Sciences

Upload: others

Post on 27-Apr-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Analysis of GPU Utilization Trends on the Keeneland

July 18, 2012

An Analysis of GPU Utilization Trends on the Keeneland Initial Delivery System Tabitha K Samuel, Stephen McNally, John Wynkoop National Institute for Computational Sciences

Page 2: An Analysis of GPU Utilization Trends on the Keeneland

The Keeneland Project

•  5  year  track  2  D  coopera.ve  agreement  awarded  by  the  NSF  •  Partners  –  Georgia  Tech,  Oak  Ridge  Na.onal  Lab,  Na.onal  

Ins.tute  for  Computa.onal  Sciences  and  the  University  of  Tennessee  

•  Keeneland  Ini.al  Delivery  System  (KIDS)  is  being  used  to  develop  programming  tools  and  libraries  for  a  GPGPU  plaMorm  

Page 3: An Analysis of GPU Utilization Trends on the Keeneland

Keeneland Partners

Page 4: An Analysis of GPU Utilization Trends on the Keeneland

KIDS Specifications

Node  architecture   HP  ProLiant  SL390  G7  

CPU   Intel  Xeon  X5660  (Westmere)/  12  cores  per  node  

Host  memory  per  node   24  Gbytes  

GPU  Architecture   Nvidia  Tesla  M2090  (Fermi)  

GPUs  per  node   3  

GPU  memory  per  node   18  Gbytes  (6  Gbytes  per  GPU)  

CPU:GPU  ra.o   2:3  

Interconnect   InfiniBand  QDR  (single  rail)  

Total  number  of  nodes   120  

Total  CPU  cores   1,440  

Total  GPU  cores   161,280  

Page 5: An Analysis of GPU Utilization Trends on the Keeneland

Need for a monitoring tool

•  Most  applica.ons  did  not  have  the  appropriate  administra.ve  tools  and  vendor  support.  

•  GPU  administra.on  has  largely  been  an  aaerthought  as  vendors  in  this  space  are  focused  on  gaming  and  video  applica.ons.  

•  There  is  a  compelling  need  to  monitor  GPU  u.liza.on  on  Keeneland  for  the  purposes  of  proper  system  administra.on  and  future  planning  for  Keeneland  Final  System  

Page 6: An Analysis of GPU Utilization Trends on the Keeneland

Design of the monitoring tool

•  In  CUDA  4.1,  NVIDIA  provided  enhanced  func.onality  for  the  nvidia-­‐system  management  interface  (nvidia-­‐smi)  

•  NVML  -­‐  NVIDIA  Management  Library    – C-­‐based  API  for  monitoring  and  managing  various  states  of  the  NVIDIA  GPU  devices.    

–  It  provides  a  direct  access  to  the  queries  and  commands  exposed  via  nvidia-­‐smi.  

– Data  is  presented  in  plain  text  or  xml  format  

Page 7: An Analysis of GPU Utilization Trends on the Keeneland

Sample output of nvidia-smi -q -d utilization

Page 8: An Analysis of GPU Utilization Trends on the Keeneland

Design of monitoring tool

Database  

……   Bash  Script  

Tmp  file  

Python  script  

Compute  node  60  

Bash  Script  

Tmp  file  

Python  script  

Compute  node  2  

Bash  Script  

Tmp  file  

Python  script  

Compute  node  1  

Page 9: An Analysis of GPU Utilization Trends on the Keeneland

Design of monitoring tool

•  If  script  throws  an  excep.on,  an  email  is  sent  to  the  system  administrators  

•  Script  run  by  cron  on  60  service  nodes  on  Keeneland  – Run  every  30  minutes  – Every  run  produces  8kb  of  data    

Page 10: An Analysis of GPU Utilization Trends on the Keeneland

Analysis of Data

•  CPU  U.liza.on  and  Overall  GPU  U.liza.on  

Total  GPU  u.liza.on,  when  compared  to  CPU  u.liza.on,  is  rela.vely  low  

Page 11: An Analysis of GPU Utilization Trends on the Keeneland

CPU Utilization and Overall GPU Utilization

•  Possible  reasons  for  low  u.liza.on  of  GPUs  Low  U.liza.on  

Applica.on’s  ability  to  fully  

u.lize  all  GPUs  in  a  mul.  GPU  environment  

Limited  bandwidth  per  FLOP  available  out  of  a  single  compute  node  

Ability  of  an  applica.on  to  fully  u.lize  the  

performance  of  a  single  GPU.  

Page 12: An Analysis of GPU Utilization Trends on the Keeneland

CPU Utilization and Overall GPU Utilization – Caveats

•  KIDS  is  a  developmental  system  hence  it  is  difficult  to  assert  if  lack  of  u.liza.on  is  due  to  deficiencies  in  applica.on  or  if  the  user  is  ar.ficially  limi.ng  GPU  usage  during  tes.ng  or  debugging  

•  Further  development  of  the  toolset  is  intended  to  give  more  granular  data,  allowing  more  accurate  conclusions  

Page 13: An Analysis of GPU Utilization Trends on the Keeneland

Overall GPU Utilization by Application

0 20 40 60 80

100

Perc

enta

ge U

tiliz

atio

n

Software Package

GPU and Memory Utilization by Software Package

Average GPU Utilization

Average Memory Utilization

•  Several  applica.ons  have  GPU  u.liza.ons  >  50%  on  an  average  

•  Memory  u.liza.on  is  significantly  lower  than  GPU  u.liza.on  

•  Unclear  if  this  is  due  to  bandwidth  constraints,  applica.on  design  or  other  factors  

Page 14: An Analysis of GPU Utilization Trends on the Keeneland

CPU Utilization and Requested GPU Utilization

0  

20  

40  

60  

80  

100  

Percen

tage  U+liza+

on  

Timeline  

CPU  U+liza+on  vs  Requested  GPU  U+liza+on  

CPU  U.liza.on  

U.lza.on  of  Requested  GPUs  

•  Applica.ons  that  do  request  GPUs,  make  reasonable  u.liza.on  of  them  

Page 15: An Analysis of GPU Utilization Trends on the Keeneland

CPU Utilization and Requested GPU Utilization

0  20  40  60  80  100  

Percen

tage  U+liza+

on  

Timeline  

CPU  U+liza+on  vs  Overall  GPU  U+liza+on  

CPU  U.liza.on   Overall  GPU  U.liza.on  

0  

20  

40  

60  

80  

100  

Percen

tage  U+liza+

on  

Timeline  

CPU  U+liza+on  vs  Requested  GPU  U+liza+on  

CPU  U.liza.on   U.lza.on  of  Requested  GPUs  

•  Possible  reasons  for  this  significant  difference:  – User  could  be  limi.ng  the  scope  of  the  applica.on  for  tes.ng  and  debugging  

–    Applica.ons  cannot  adequately  scale  past  one  GPU  per  process  due  to  limita.ons  in  the  code  or  the  limited  inter-­‐node  bandwidth.  

Page 16: An Analysis of GPU Utilization Trends on the Keeneland

Number of jobs and number of GPUs requested per job

0 200 400 600 800 1000

# of

Job

s

Number of GPUs requested

Number of Jobs vs Number of GPUs/Job (> 3 GPUs)

0 10000 20000 30000 40000

# of

Job

s

Number of GPUs Requested

Number of Jobs vs Number of GPUs/Job (Overall)

•  Majority  of  jobs  on  KIDS  request  fewer  than  3  GPUs  – Due  to  large  number  of  very  small,  short  jobs  being  used  for  applica.on  development    

– Once  system  is  in  produc.on,  this  number  should  dras.cally  increase  

Page 17: An Analysis of GPU Utilization Trends on the Keeneland

Issues encountered during development of toolkit

•  Large  volume  of  data  generated  by  the  output  of  the  nvidia-­‐smi  u.lity  –  Future  versions  of  NVML  should  allow  administrators  to  select  only  relevant  data  

•  Failure  mode  of  the  nvidia-­‐smi  tool  is  unpredictable  when  there  is  a  poten.ally  faulty  GPU  in  the  system  –  Tool  some.mes  generates  erra.c  output,  no  output  or  seemingly  normal  output    

–  This  makes  diagnosing  problem  GPUs  on  a  large  scale  difficult  

Page 18: An Analysis of GPU Utilization Trends on the Keeneland

Other monitoring tools

•  Provides  CLI  &  GUI  interface  

•  Tool  that  can  be  used  for  management,  provisioning  and  monitoring  hybrid  HP  systems.  

HP  Insight  Cluster  Management  U.lity  

•  Uses  python  binding  for  NVML  

•  Allows  simplified  access  to  GPU  metrics  like  temperature,  memory  usage  and  u.liza.on  

Ganglia’s  Gmond  Python  module  

Page 19: An Analysis of GPU Utilization Trends on the Keeneland

Comparison with other tools

•  Gmond  presents  data  in  RRD  format  which  is  an  abbreviated,  averaged  version  of  data  

•  Extremely  high  level,  which  is  not  useful  if  you  are  trying  to  understand  u.liza.on  at  a  par.cular  moment  in  .me  

•  Our  tool  collects  data  over  .me  and  does  not  average  it  

•  Allows  us  to  maintain  granularity  much  farther  into  the  future  

•  Useful    in  scenarios  where  you  can  correlate  GPU  usage  with  ECC  errors    

•  Easy  to  get  sta.s.cs  of  u.liza.on  with  respect  to  job  sizes,  wall  clock  requests,  GPU  requests  etc.  

Other  considera.ons:  •  Commercial  tools  were  expensive  •  Open  source  alterna.ves  were  early  in  produc.on  and  development  •  Had  a  pressing  need  to  provide  very  specific  data  to  our  review  panel  

Page 20: An Analysis of GPU Utilization Trends on the Keeneland

Conclusions

•  This  tool  provides  an  important  first  step  in  crea.ng  an  open  sourced  tool  for  collec.on  of  u.liza.on  sta.s.cs  for  GPU  based  systems  

•  Not  many  monitoring  tools  are  available  for  GPU  systems,  few  that  are,  are  expensive  or  in  early  development  

•  High  level  study  of  data  reveals  that  soaware  barring  a  few,  are  s.ll  CPU  cycle  heavy  and  do  not  take  full  advantage  of  the  processing  power  of  GPUs  

Page 21: An Analysis of GPU Utilization Trends on the Keeneland

Future Work

•  Collec.on  of  other  sta.s.cs  such  as  ECC  errors,  power  and  temperature  sta.s.cs    

•  Collec.on  of  sta.s.cs  on  a  more  frequent  basis  •  Collaborate  with  soaware  developers  to  mine  the  data  generated  by  this  tool    – Data  can  be  used  to  aid  soaware  development  for  GPU  systems  

– Data  can  also  be  used  to  determine  appropriate  CPU:GPU  ra.os  for  jobs  and  assist  in  crea.ng  scheduling  policies  

Page 22: An Analysis of GPU Utilization Trends on the Keeneland

Questions