gptl: a simple and free general purpose tool for performance analysis and profiling april 8, 2014...
TRANSCRIPT
GPTL: A simple and free general purpose tool for performance analysis and profiling
April 8, 2014
Jim RosinskiNOAA/ESRL
NCAR SEA 2
Outline
• Motivation and Basic Usage• Auto-instrumentation• Auto-profiling MPI routines• Summary across threads and tasks• Induced overhead• Choice of underlying timing routine• PAPI interface• Utility functions• Future work
NCAR SEA 3
Motivation
• Needed something to simplify, for an arbitrary number of regions to be timed:
time = 0;for (i = 0; i < 10; i++) { gettimeofday (tp1,0); compute (); gettimeofday (tp2,0); delta = tp2.tv_sec - tp1.tv_sec + 1.e6*(tp2.tv_usec - tp1.tv_usec); time += delta;}printf (“compute took %g seconds\n”, time);
NCAR SEA 4
Solution
#include <gptl.h>...ret = GPTLinitialize ()ret = GPTLstart (“total”);for (i = 0; i < 10; i++) { ret = GPTLstart (“compute”); compute (); ret = GPTLstop (“compute”); ...}ret = GPTLstop (“total”);ret = GPTLpr (0);
NCAR SEA 5
Results
• Output file timing.0 contains:
Called Wallclock total 1 3.983 compute 10 3.877
NCAR SEA 6
Most of the API#include <gptl.h>
...
ret = GPTLsetoption (PAPI_FP_OPS, 1); // Enable a PAPI counter
ret = GPTLsetutr (GPTLnanotime); // Better wallclock timer
...
ret = GPTLinitialize (); // Once per process
ret = GPTLstart (“total”); // Start a timer
ret = GPTLstart (“compute”); // Start another timer
compute (); // Do work
ret = GPTLstop (“compute”); // Stop a timer
...
ret = GPTLstop (“total”); // Stop a timer
ret = GPTLpr (iam); // Print results
ret = GPTLpr_summary (MPI_COMM_WORLD); // Print results summary
// across threads and tasks
NCAR SEA 7
Set options via Fortran namelist
• Avoid recoding/recompiling by using Fortran namelist option:
call gptlprocess_namelist (‘my_namelist’, unitno, ret)
• Example contents of ‘my_namelist’:
&gptlnl
utr = ‘nanotime’
eventlist = ‘GPTL_CI’,’PAPI_FP_OPS‘
/
NCAR SEA 8
Auto-instrumentation
• Works with Intel, GNU, Pathscale, PGI, AIX
# icc –g –finstrument-functions *.c –lgptl# gfortran –g –finstrument-functions *.f90 –lgptl# pgcc –g –Minstrument:functions *.c –lgptl
• Inserts automatically at function start:__cyg_profile_func_enter (void *this_fn, void *call_site);
• And at function exit:__cyg_profile_func_exit (void *this_fn, void *call_site);
NCAR SEA 9
Auto-instrumentation (cont’d)
• GPTL handles these entry points with:
void __cyg_profile_func_enter (void *this_fn, void *call_site){ (void) GPTLstart_instr (this_fn);}
void __cyg_profile_func_exit (void *this_fn, void *call_site){ (void) GPTLstop_instr (this_fn);}
NCAR SEA 10
Auto-instrumentation (cont’d)
• After running the app, convert addresses to names with:
hex2name.pl [-demangle] <executable> <timing_file>
NCAR SEA 11
Dynamic call tree from auto-instrumentation
Stats for thread 0: Called Wallclock max min FP_OPS total 1 64.021 64.021 64.021 3.50e+08 HPCC_Init 11 0.157 0.157 0.000 95799* HPL_pdinfo 120 0.019 0.018 0.000 96996* HPL_all_reduce 7 0.043 0.036 0.000 448* HPL_broadcast 21 0.041 0.036 0.000 126
HPL_pdlamch 2 0.004 0.004 0.000 94248* HPL_fprintf 240 0.001 0.000 0.000 1200 HPCC_InputFileInit 41 0.001 0.001 0.000 194
ReadInts 2 0.000 0.000 0.000 12 PTRANS 21 22.667 22.667 0.000 4.19e+07 MaxMem 5 0.000 0.000 0.000 796* iceil_ 132 0.000 0.000 0.000 792* ilcm_ 14 0.000 0.000 0.000 84 param_dump 18 0.000 0.000 0.000 84 Cblacs_get 5 0.000 0.000 0.000 30 Cblacs_gridmap 35 0.005 0.001 0.000 225* Cblacs_pinfo 7 0.000 0.000 0.000 40* Cblacs_gridinfo 60 0.000 0.000 0.000 260
NCAR SEA 12
MPI Auto-instrumentation
• To enable MPI auto-instrumentation, in macros.make set this:– ENABLE_PMPI=yes
NCAR SEA 13
MPI Auto-instrumentation (cont’d)Stats for thread 0: Called Wallclock max min AVG_MPI_BYTES MPI_Init_thru_Finalize 1 8.70e-04 8.70e-04 8.70e-04 - MPI_Send 1 5.10e-05 5.10e-05 5.10e-05 4.096e+03 MPI_Recv 3 2.63e-04 2.32e-04 1.50e-05 4.096e+03 MPI_Ssend 1 2.40e-05 2.40e-05 2.40e-05 4.096e+03 MPI_Issend 1 1.00e-05 1.00e-05 1.00e-05 4.096e+03 MPI_Sendrecv 1 1.80e-05 1.80e-05 1.80e-05 8.192e+03 MPI_Irecv 2 1.00e-05 9.00e-06 1.00e-06 4.096e+03 MPI_Isend 2 6.00e-06 4.00e-06 2.00e-06 4.096e+03 MPI_Wait 2 1.80e-05 1.70e-05 1.00e-06 - MPI_Waitall 2 1.10e-05 1.10e-05 0.00e+00 - MPI_Barrier 1 2.20e-05 2.20e-05 2.20e-05 - MPI_Bcast 1 9.00e-06 9.00e-06 9.00e-06 4.096e+03
NCAR SEA 14
Induced Overhead
• GPTL estimates its own overhead:overhead of 1 GPTLstart or GPTLstop call=1.28e-07 seconds
Components are as follows:Fortran layer: 1.0e-09 = 1.5% of totalGet thread number: 1.7e-08 = 13.3% of totalGenerate hash index: 1.9e-08 = 14.8% of totalFind hashtable entry: 1.5e-08 = 11.7% of totalUnderlying timing routine: 7.0e-08 = 53.2% of totalMisc start/stop functions: 7.0e-09 = 5.5% of total
NCAR SEA 15
Induced Overhead (cont’d)Stats for thread 0: Called Wallclock max min self_OH parent_OH total 1 0.910 0.910 0.910 0.000 0.000 1x1e7 1 0.022 0.022 0.022 0.000 0.000 10x1e6 10 0.015 1.55e-03 1.36e-03 0.000 0.000 100x1e5 100 0.014 1.80e-04 1.11e-04 0.000 0.000 1000x1e4 1000 0.015 2.01e-05 1.11e-05 0.000 0.000 1e4x1000 10000 0.015 1.04e-05 1.12e-06 0.000 0.001 1e5x100 100000 0.015 9.05e-06 1.22e-07 0.001 0.006 1e6x10 1.0e+06 0.026 8.74e-06 1.67e-08 0.011 0.062 1e7x1 1.0e+07 0.180 8.74e-06 1.11e-08 0.108 0.618
NCAR SEA 16
Underlying timing routine
• Default is gettimeofday()• For Intel arch’s change to register read which
has better granularity and much lower overhead:– C or Fortran: GPTLsetutr(GPTLnanotime);– Fortran: utr = ‘nanotime’ in namelist &gptlnl– May cause problems on machines with variable clock rate (e.g. “turbo mode”)
NCAR SEA 17
PAPI details handled by GPTL• This call:
GPTLsetoption (PAPI_FP_OPS, 1);
• Implies: PAPI_library_init (PAPI_VER_CURRENT));
PAPI_thread_init ((unsigned long (*)(void(pthread_self));
PAPI_create_eventset (&EventSet[t]));
PAPI_assign_eventset_component (EventSet[t], 0);
PAPI_multiplex_init ();
PAPI_set_multiplex (EventSet[t]);
PAPI_add_event (EventSet[t], PAPI_FP_OPS));
PAPI_start (EventSet[t]);
• PAPI multiplexing handled automatically, enabled only if needed
NCAR SEA 18
timing.summary file generated by GPTLpr_summary(comm)
name ncalls nranks mean_time std_dev wallmax (rank ) wallmin (rank )Diag 1002 2 4.371 3.453 6.812 ( 0) 1.929 ( 1)MainLoop 2 2 53.364 0.007 53.369 ( 0) 53.359 ( 1)ZeroTendencies 200 2 0.086 0.030 0.107 ( 0) 0.065 ( 1)SaveFlux 200 2 0.149 0.048 0.183 ( 0) 0.115 ( 1)RHStendencies 800 2 0.421 0.148 0.526 ( 0) 0.317 ( 1)Vdtotal 1600 2 25.702 1.361 26.665 ( 0) 24.740 ( 1)Vdm 800 2 23.851 1.118 24.642 ( 0) 23.060 ( 1)vdmfinish 800 2 2.794 1.010 3.508 ( 0) 2.080 ( 1)Vdn 800 2 1.848 0.246 2.022 ( 0) 1.674 ( 1)Flux 800 2 4.818 1.135 5.620 ( 1) 4.015 ( 0)Force 800 2 1.901 0.110 1.979 ( 1) 1.823 ( 0)RKdiff 800 2 1.247 0.415 1.540 ( 0) 0.953 ( 1)TimeDiff 800 2 0.736 0.182 0.865 ( 0) 0.608 ( 1)Sponge 800 2 0.364 0.092 0.429 ( 0) 0.299 ( 1)pre_trisol 200 2 0.112 0.027 0.131 ( 0) 0.093 ( 1)Trisol 200 2 0.667 0.078 0.722 ( 1) 0.612 ( 0)post_trisol 200 2 0.082 0.012 0.090 ( 0) 0.073 ( 1)Vdmints 200 2 3.603 0.135 3.699 ( 0) 3.508 ( 1)Pstadv 200 2 0.849 0.044 0.880 ( 1) 0.817 ( 0)
NCAR SEA 19
Utility functions
• To print current memory usage at any point in your code:– ret = GPTLprint_memusage (“user string”)
• Produces e.g.– GPTLprint_memusage: user string size=19.5 MB rss=2.1 MB
datastack=1.5 MB
• To auto-profile current memory usage (at both function entry and exit points) :– ret = GPTLsetoption (GPTLdopr_memusage, 1);
• Retrieve wallclock, usr, sys timestamps to user code:– ret = GPTLstamp (&wallclock, &usr, &sys);
NCAR SEA 20
Future Work
• XML output• Port to GPU• Dynamic thread allocation for PTHREADS
option• Autoconf?
NCAR SEA 21
Source and Documentation
• Source: https://github.com/jmrosinski/GPTL– git clone [email protected]:jmrosinski/GPTL.git
• Web-based documentation:– jmrosinski.github.io/GPTL
• Feel free to email me: [email protected]