tracer:( aparalleltracereplaytoolfor ...€¦ · tracer:(aparalleltracereplaytoolfor...

22
TraceR: A Parallel Trace Replay Tool for Studying Interconnection Networks Bilge Acun, PhD Candidate Department of Computer Science University of Illinois at UrbanaChampaign *Contributors: Nikhil Jain, Abhinav Bhatele, Misbah Mubarak, Christopher D. Carothers, and Laxmikant V. Kale 1

Upload: others

Post on 23-Aug-2020

29 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TraceR:( AParallelTraceReplayToolfor ...€¦ · TraceR:(AParallelTraceReplayToolfor StudyingInterconnectionNetworks Bilge&Acun,&PhD&Candidate& Departmentof&Computer&Science& University&of&Illinois&atUrbanaChampaign&

TraceR:    A  Parallel  Trace  Replay  Tool  for  Studying  Interconnection  Networks  Bilge  Acun,  PhD  Candidate  Department  of  Computer  Science  University  of  Illinois  at  Urbana-­‐Champaign    *Contributors:  Nikhil  Jain,  Abhinav  Bhatele,  Misbah  Mubarak,  Christopher  D.  Carothers,  and  Laxmikant  V.  Kale  

1  

Page 2: TraceR:( AParallelTraceReplayToolfor ...€¦ · TraceR:(AParallelTraceReplayToolfor StudyingInterconnectionNetworks Bilge&Acun,&PhD&Candidate& Departmentof&Computer&Science& University&of&Illinois&atUrbanaChampaign&

Network  Simulation  •  Mo#va#on:  •  Design  of  the  future  supercomputers  

•  Node  architecture    •  InterconnecLon  network  

•  Predict  applicaLon  performance  •  On  exisLng  –  non  exisLng  architectures  

•   State-­‐of-­‐the  art:  •  Discrete  event  based  simulaLon  

•  Not  parallel  or  scalable  •  Large  memory  footprints  

•  Cannot  simulate  real  HPC  workloads  •  SyntheLc  communicaLon  paQerns  •  Skeletonized  codes  

2  

Page 3: TraceR:( AParallelTraceReplayToolfor ...€¦ · TraceR:(AParallelTraceReplayToolfor StudyingInterconnectionNetworks Bilge&Acun,&PhD&Candidate& Departmentof&Computer&Science& University&of&Illinois&atUrbanaChampaign&

TraceR:  Trace  Replay  •  A  trace-­‐driven  simulator    •  OpLmisLc  parallel  discrete-­‐event  simulaLon  (PDES)  •  for  real  HPC  traffic  workloads  

•  Outperforms  state-­‐of-­‐the-­‐art  simulators  •  BigNet-­‐Sim,  SST  

•  Scalable  •  simulate  execuLon  on  half  a  million  nodes  in  under  10  minutes  using  512  cores  

•  Op#mis#c  simula#on  parameter  study    •  maximize  performance  for  simulaLng  real  HPC  traffic  workloads  

3  

Page 4: TraceR:( AParallelTraceReplayToolfor ...€¦ · TraceR:(AParallelTraceReplayToolfor StudyingInterconnectionNetworks Bilge&Acun,&PhD&Candidate& Departmentof&Computer&Science& University&of&Illinois&atUrbanaChampaign&

TraceR  Components  

TraceR

CODES ROSS

Performance Prediction

Network Configuration

Application Traces from BigSim

PDES parameters

4  

Input  

Output  

PDES  Framework  

Accurate  packet  level  network  models;  Torus,  Dragonfly,  ...    

AMPI  &    Charm++  

Dimensions,  bandwidth,  packet  size,  …  

Page 5: TraceR:( AParallelTraceReplayToolfor ...€¦ · TraceR:(AParallelTraceReplayToolfor StudyingInterconnectionNetworks Bilge&Acun,&PhD&Candidate& Departmentof&Computer&Science& University&of&Illinois&atUrbanaChampaign&

BigSim  Simulator  •  One  of  the  earliest  packet-­‐level  HPC  network  simulator  •  Around  2004    

•  EmulaLon  framework  •  Can  generate  traces  using  much  less  cores  than  actual    

•  Built  on  POSE  PDES  framework  •  Cause  of  the  slow  performance  •  Poor  scaling  

5  

Page 6: TraceR:( AParallelTraceReplayToolfor ...€¦ · TraceR:(AParallelTraceReplayToolfor StudyingInterconnectionNetworks Bilge&Acun,&PhD&Candidate& Departmentof&Computer&Science& University&of&Illinois&atUrbanaChampaign&

TraceR  Components  

TraceR

CODES ROSS

Performance Prediction

Network Configuration

Application Traces from BigSim

PDES parameters

6  

Input  

Output  

PDES  Framework  

Accurate  packet  level  network  models;  Torus,  Dragonfly,  ...    

AMPI  &    Charm++  

Dimensions,  bandwidth,  packet  size,  …  

Page 7: TraceR:( AParallelTraceReplayToolfor ...€¦ · TraceR:(AParallelTraceReplayToolfor StudyingInterconnectionNetworks Bilge&Acun,&PhD&Candidate& Departmentof&Computer&Science& University&of&Illinois&atUrbanaChampaign&

BigSim  Trace  Format  

 Time  Stamp,  Task  ID,    Name,  Dura#on,  …,  Msg  ID,  Source  Node,  …,  Back&Forward  Dep.  -­‐1.000000  47  AMPI_Bcast-­‐-­‐Lme:5960  0.000006    ...    $B  46    $F  53  0.001148  48  start-­‐broadcast-­‐-­‐Lme:0  0.000000  ...  $B    $F  49  -­‐1.000000  49  AMPI_generic-­‐-­‐Lme:3099  0.000003  ..  $B  48  $F  50  52  -­‐1.000000  50  end-­‐broadcast-­‐-­‐Lme:0  0.000000    ...  $B  49    $F  0.001151  51  msgep-­‐-­‐Lme:953  0.000001      …  $B    $F  0.001154  52  RECV_RESUME-­‐-­‐Lme:953  0.000001  ...  $B  49    $F  53  -­‐1.000000  60  user_code-­‐-­‐Lme:0  0.000000  ...  $B  59  54    $F  61  

7  

•  Entry  for  each  SequenLal  ExecuLon  Block  (SEB)  

Page 8: TraceR:( AParallelTraceReplayToolfor ...€¦ · TraceR:(AParallelTraceReplayToolfor StudyingInterconnectionNetworks Bilge&Acun,&PhD&Candidate& Departmentof&Computer&Science& University&of&Illinois&atUrbanaChampaign&

DeAinitions  and  Evaluation  Metrics  Defini#ons:    •  PE:  simulated  process,  logical  process  (LP)  visible  to  ROSS  •  Task:  sequenLal  execuLon  block  (SEB)  •  Event:    represents  an  acLon  with  a  Lme-­‐stamp  in  the  PDES  •   Kickoff  Event,  Message  Recv  Event,  CompleLon  Event  

•  Reverse  Handler:    responsible  for  reversing  the  effect  of  an  event  Metrics:    •  ExecuLon  Lme:  Lme  spent  in  performing  the  simulaLon  •  Event  rate:  number  of  events  executed  per  second  (excl.  roll  backs)  

•  Event  efficiency:  (or  rollback  efficiency)  

8  

Page 9: TraceR:( AParallelTraceReplayToolfor ...€¦ · TraceR:(AParallelTraceReplayToolfor StudyingInterconnectionNetworks Bilge&Acun,&PhD&Candidate& Departmentof&Computer&Science& University&of&Illinois&atUrbanaChampaign&

TraceR:  Execution  Alow  

Execute Task

First task

Send message to other PEs

Schedule completion event

Receive message from other PEs

Completion Event

Message Recv Event

Remote Message

9  

ROSS  Events  

TraceR  funcLons  

Page 10: TraceR:( AParallelTraceReplayToolfor ...€¦ · TraceR:(AParallelTraceReplayToolfor StudyingInterconnectionNetworks Bilge&Acun,&PhD&Candidate& Departmentof&Computer&Science& University&of&Illinois&atUrbanaChampaign&

Experimental  Results  •  Scaling  results  are  done  with  Blue  Waters  at  UIUC  •  PredicLon  study  results  are  with  Vulcan  at  LLNL  

•  ApplicaLons:  •  3D  Stencil:    

•  AMPI  applicaLon  •  7  point  Jacobi  relaxaLon  on  3D  grid  •  128  x  128  x  128  grid  points  per  MPI  process  -­‐>  128KB  msgs  

•  LeanMD:    •  Charm++  applicaLon  •  Mini-­‐app  version  of  NAMD  molecular  dynamics  simulaLon  •  Mimics  short-­‐range  force  calculaLons  of  NAMD  •  1.2  million  atoms  

 

10  

Page 11: TraceR:( AParallelTraceReplayToolfor ...€¦ · TraceR:(AParallelTraceReplayToolfor StudyingInterconnectionNetworks Bilge&Acun,&PhD&Candidate& Departmentof&Computer&Science& University&of&Illinois&atUrbanaChampaign&

Sequential  Comparison  of  Simulators    

���

����

�����

������

����� ����� ������ ������ ������

��������

����������������������������������

���������������������������������������

�������������������

�������������������������

11  Skeletonized    MPI  code  

Page 12: TraceR:( AParallelTraceReplayToolfor ...€¦ · TraceR:(AParallelTraceReplayToolfor StudyingInterconnectionNetworks Bilge&Acun,&PhD&Candidate& Departmentof&Computer&Science& University&of&Illinois&atUrbanaChampaign&

Conservative  vs.  Optimistic  

���

����

�����

������

�� �� �� �� ���

������������������

���������������

�����������������������������������������

��������������������������������

�������������������������������

12  

Page 13: TraceR:( AParallelTraceReplayToolfor ...€¦ · TraceR:(AParallelTraceReplayToolfor StudyingInterconnectionNetworks Bilge&Acun,&PhD&Candidate& Departmentof&Computer&Science& University&of&Illinois&atUrbanaChampaign&

��

���

����

�����

�� �� �� �� ��� ��� ��� ���� ���� ����

��������

���������������

�������������������������������������

��������

�����

13  

TraceR  Scaling  w/  AMPI  app.  

Page 14: TraceR:( AParallelTraceReplayToolfor ...€¦ · TraceR:(AParallelTraceReplayToolfor StudyingInterconnectionNetworks Bilge&Acun,&PhD&Candidate& Departmentof&Computer&Science& University&of&Illinois&atUrbanaChampaign&

14  ���

����

�����

������

�� �� �� �� ��� ��� ��� ���� ���� ����

��������

���������������

���������������������������������������

��������

�����

TraceR  Scaling  w/  AMPI  app.  

Page 15: TraceR:( AParallelTraceReplayToolfor ...€¦ · TraceR:(AParallelTraceReplayToolfor StudyingInterconnectionNetworks Bilge&Acun,&PhD&Candidate& Departmentof&Computer&Science& University&of&Illinois&atUrbanaChampaign&

Event  EfAiciency  

���

��

���

���

���

���

����

�� �� �� �� ��� ��� ��� ����

�������������������

���������������

�����������������������������

������������������������������������������������������

15  

Page 16: TraceR:( AParallelTraceReplayToolfor ...€¦ · TraceR:(AParallelTraceReplayToolfor StudyingInterconnectionNetworks Bilge&Acun,&PhD&Candidate& Departmentof&Computer&Science& University&of&Illinois&atUrbanaChampaign&

Trace  Reading  Time  

��

���

���

���

���

���

� � � � �� �� �� ��� ��� ���

��������

���������������

����������������������

�����

��������

16  

•  Insignificant  overhead  with  increasing  number  of  cores!  

Page 17: TraceR:( AParallelTraceReplayToolfor ...€¦ · TraceR:(AParallelTraceReplayToolfor StudyingInterconnectionNetworks Bilge&Acun,&PhD&Candidate& Departmentof&Computer&Science& University&of&Illinois&atUrbanaChampaign&

17  ����

��

���

���� ����� ����� ����� �����

�����������������

���������������

��������������������������������������������

�������������������������������

�������

����

����

��

TraceR  Performance  Prediction  w/  Charm++  app.    

Page 18: TraceR:( AParallelTraceReplayToolfor ...€¦ · TraceR:(AParallelTraceReplayToolfor StudyingInterconnectionNetworks Bilge&Acun,&PhD&Candidate& Departmentof&Computer&Science& University&of&Illinois&atUrbanaChampaign&

Event  Rate:  million  events/s  

��

��

��

��

��

�� �� ��� ��� ��������������������������������

��������������������������

������������������������������������������������

����������������

�������������

������

18  

Page 19: TraceR:( AParallelTraceReplayToolfor ...€¦ · TraceR:(AParallelTraceReplayToolfor StudyingInterconnectionNetworks Bilge&Acun,&PhD&Candidate& Departmentof&Computer&Science& University&of&Illinois&atUrbanaChampaign&

EfAiciency  

���

���

���

���

�� �� ��� ��� ���

�������������������

��������������������������

������������������������������������������������

�����������������������������������

19  

Page 20: TraceR:( AParallelTraceReplayToolfor ...€¦ · TraceR:(AParallelTraceReplayToolfor StudyingInterconnectionNetworks Bilge&Acun,&PhD&Candidate& Departmentof&Computer&Science& University&of&Illinois&atUrbanaChampaign&

Ongoing  Work  and  Summary  •  Ongoing  &  future  work:  •  Fat-­‐tree  network  model  

•  Integrated  into  CODES  •  MulLple  job  simulaLons  

•  Effect  of  mulLple  jobs  in  the  network  •  More  realisLc  scenario  

•  Switch  to  Charm++  based  ROSS  from  MPI  based  ROSS  

•  TraceR  feature  highlights:  •  A  parallel,  trace-­‐driven,  scalable  network  simulator  •  Support  for  various  topologies:  Torus,  Dragonfly,  Fat-­‐tree    •  Simulate  AMPI,  Charm++  applica#ons  •  Can  simulate  half  a  million  nodes  in  minutes  

20  

Page 21: TraceR:( AParallelTraceReplayToolfor ...€¦ · TraceR:(AParallelTraceReplayToolfor StudyingInterconnectionNetworks Bilge&Acun,&PhD&Candidate& Departmentof&Computer&Science& University&of&Illinois&atUrbanaChampaign&

Thank  you!    •  Paper  in  progress:  

Bilge  Acun,  Nikhil  Jain,  Abhinav  Bhatele,  Misbah  Mubarak,  Christopher  D.  Carothers,  and  Laxmikant  V.  Kale.  TraceR:  A  Parallel  Trace  Replay  Tool  for  Studying  InterconnecLon  Networks    

•  TraceR  source  code:  •  hQp://charm.cs.uiuc.edu/gerrit/#/admin/projects/tracer  

 

21  

Page 22: TraceR:( AParallelTraceReplayToolfor ...€¦ · TraceR:(AParallelTraceReplayToolfor StudyingInterconnectionNetworks Bilge&Acun,&PhD&Candidate& Departmentof&Computer&Science& University&of&Illinois&atUrbanaChampaign&

��

���

����

�����

������

�� �� �� �� ��� ��� ��� ����

��������

���������������

���������������������������������������������

�����������������

22  

TraceR  Scaling  w/  Charm++  app.