gptl: a simple and free general purpose tool for performance analysis and profiling april 8, 2014...

GPTL: A simple and free general purpose tool for performance analysis and profiling

April 8, 2014

Jim RosinskiNOAA/ESRL

NCAR SEA 2

Outline

• Motivation and Basic Usage• Auto-instrumentation• Auto-profiling MPI routines• Summary across threads and tasks• Induced overhead• Choice of underlying timing routine• PAPI interface• Utility functions• Future work

NCAR SEA 3

Motivation

• Needed something to simplify, for an arbitrary number of regions to be timed:

time = 0;for (i = 0; i < 10; i++) { gettimeofday (tp1,0); compute (); gettimeofday (tp2,0); delta = tp2.tv_sec - tp1.tv_sec + 1.e6*(tp2.tv_usec - tp1.tv_usec); time += delta;}printf (“compute took %g seconds\n”, time);

NCAR SEA 4

Solution

#include <gptl.h>...ret = GPTLinitialize ()ret = GPTLstart (“total”);for (i = 0; i < 10; i++) { ret = GPTLstart (“compute”); compute (); ret = GPTLstop (“compute”); ...}ret = GPTLstop (“total”);ret = GPTLpr (0);

NCAR SEA 5

Results

• Output file timing.0 contains:

Called Wallclock total 1 3.983 compute 10 3.877

NCAR SEA 6

Most of the API#include <gptl.h>

...

ret = GPTLsetoption (PAPI_FP_OPS, 1); // Enable a PAPI counter

ret = GPTLsetutr (GPTLnanotime); // Better wallclock timer

...

ret = GPTLinitialize (); // Once per process

ret = GPTLstart (“total”); // Start a timer

ret = GPTLstart (“compute”); // Start another timer

compute (); // Do work

ret = GPTLstop (“compute”); // Stop a timer

...

ret = GPTLstop (“total”); // Stop a timer

ret = GPTLpr (iam); // Print results

ret = GPTLpr_summary (MPI_COMM_WORLD); // Print results summary

// across threads and tasks

NCAR SEA 7

Set options via Fortran namelist

• Avoid recoding/recompiling by using Fortran namelist option:

call gptlprocess_namelist (‘my_namelist’, unitno, ret)

• Example contents of ‘my_namelist’:

&gptlnl

utr = ‘nanotime’

eventlist = ‘GPTL_CI’,’PAPI_FP_OPS‘

/

NCAR SEA 8

Auto-instrumentation

• Works with Intel, GNU, Pathscale, PGI, AIX

# icc –g –finstrument-functions *.c –lgptl# gfortran –g –finstrument-functions *.f90 –lgptl# pgcc –g –Minstrument:functions *.c –lgptl

• Inserts automatically at function start:__cyg_profile_func_enter (void *this_fn, void *call_site);

• And at function exit:__cyg_profile_func_exit (void *this_fn, void *call_site);

NCAR SEA 9

Auto-instrumentation (cont’d)

• GPTL handles these entry points with:

void __cyg_profile_func_enter (void *this_fn, void *call_site){ (void) GPTLstart_instr (this_fn);}

void __cyg_profile_func_exit (void *this_fn, void *call_site){ (void) GPTLstop_instr (this_fn);}

NCAR SEA 10

Auto-instrumentation (cont’d)

• After running the app, convert addresses to names with:

hex2name.pl [-demangle] <executable> <timing_file>

NCAR SEA 11

Dynamic call tree from auto-instrumentation

Stats for thread 0: Called Wallclock max min FP_OPS total 1 64.021 64.021 64.021 3.50e+08 HPCC_Init 11 0.157 0.157 0.000 95799* HPL_pdinfo 120 0.019 0.018 0.000 96996* HPL_all_reduce 7 0.043 0.036 0.000 448* HPL_broadcast 21 0.041 0.036 0.000 126

HPL_pdlamch 2 0.004 0.004 0.000 94248* HPL_fprintf 240 0.001 0.000 0.000 1200 HPCC_InputFileInit 41 0.001 0.001 0.000 194

ReadInts 2 0.000 0.000 0.000 12 PTRANS 21 22.667 22.667 0.000 4.19e+07 MaxMem 5 0.000 0.000 0.000 796* iceil_ 132 0.000 0.000 0.000 792* ilcm_ 14 0.000 0.000 0.000 84 param_dump 18 0.000 0.000 0.000 84 Cblacs_get 5 0.000 0.000 0.000 30 Cblacs_gridmap 35 0.005 0.001 0.000 225* Cblacs_pinfo 7 0.000 0.000 0.000 40* Cblacs_gridinfo 60 0.000 0.000 0.000 260

NCAR SEA 12

MPI Auto-instrumentation

• To enable MPI auto-instrumentation, in macros.make set this:– ENABLE_PMPI=yes

NCAR SEA 13

MPI Auto-instrumentation (cont’d)Stats for thread 0: Called Wallclock max min AVG_MPI_BYTES MPI_Init_thru_Finalize 1 8.70e-04 8.70e-04 8.70e-04 - MPI_Send 1 5.10e-05 5.10e-05 5.10e-05 4.096e+03 MPI_Recv 3 2.63e-04 2.32e-04 1.50e-05 4.096e+03 MPI_Ssend 1 2.40e-05 2.40e-05 2.40e-05 4.096e+03 MPI_Issend 1 1.00e-05 1.00e-05 1.00e-05 4.096e+03 MPI_Sendrecv 1 1.80e-05 1.80e-05 1.80e-05 8.192e+03 MPI_Irecv 2 1.00e-05 9.00e-06 1.00e-06 4.096e+03 MPI_Isend 2 6.00e-06 4.00e-06 2.00e-06 4.096e+03 MPI_Wait 2 1.80e-05 1.70e-05 1.00e-06 - MPI_Waitall 2 1.10e-05 1.10e-05 0.00e+00 - MPI_Barrier 1 2.20e-05 2.20e-05 2.20e-05 - MPI_Bcast 1 9.00e-06 9.00e-06 9.00e-06 4.096e+03

NCAR SEA 14

Induced Overhead

• GPTL estimates its own overhead:overhead of 1 GPTLstart or GPTLstop call=1.28e-07 seconds

Components are as follows:Fortran layer: 1.0e-09 = 1.5% of totalGet thread number: 1.7e-08 = 13.3% of totalGenerate hash index: 1.9e-08 = 14.8% of totalFind hashtable entry: 1.5e-08 = 11.7% of totalUnderlying timing routine: 7.0e-08 = 53.2% of totalMisc start/stop functions: 7.0e-09 = 5.5% of total

NCAR SEA 15

Induced Overhead (cont’d)Stats for thread 0: Called Wallclock max min self_OH parent_OH total 1 0.910 0.910 0.910 0.000 0.000 1x1e7 1 0.022 0.022 0.022 0.000 0.000 10x1e6 10 0.015 1.55e-03 1.36e-03 0.000 0.000 100x1e5 100 0.014 1.80e-04 1.11e-04 0.000 0.000 1000x1e4 1000 0.015 2.01e-05 1.11e-05 0.000 0.000 1e4x1000 10000 0.015 1.04e-05 1.12e-06 0.000 0.001 1e5x100 100000 0.015 9.05e-06 1.22e-07 0.001 0.006 1e6x10 1.0e+06 0.026 8.74e-06 1.67e-08 0.011 0.062 1e7x1 1.0e+07 0.180 8.74e-06 1.11e-08 0.108 0.618

NCAR SEA 16

Underlying timing routine

• Default is gettimeofday()• For Intel arch’s change to register read which

has better granularity and much lower overhead:– C or Fortran: GPTLsetutr(GPTLnanotime);– Fortran: utr = ‘nanotime’ in namelist &gptlnl– May cause problems on machines with variable clock rate (e.g. “turbo mode”)

NCAR SEA 17

PAPI details handled by GPTL• This call:

GPTLsetoption (PAPI_FP_OPS, 1);

• Implies: PAPI_library_init (PAPI_VER_CURRENT));

PAPI_thread_init ((unsigned long (*)(void(pthread_self));

PAPI_create_eventset (&EventSet[t]));

PAPI_assign_eventset_component (EventSet[t], 0);

PAPI_multiplex_init ();

PAPI_set_multiplex (EventSet[t]);

PAPI_add_event (EventSet[t], PAPI_FP_OPS));

PAPI_start (EventSet[t]);

• PAPI multiplexing handled automatically, enabled only if needed

NCAR SEA 18

timing.summary file generated by GPTLpr_summary(comm)

name ncalls nranks mean_time std_dev wallmax (rank ) wallmin (rank )Diag 1002 2 4.371 3.453 6.812 ( 0) 1.929 ( 1)MainLoop 2 2 53.364 0.007 53.369 ( 0) 53.359 ( 1)ZeroTendencies 200 2 0.086 0.030 0.107 ( 0) 0.065 ( 1)SaveFlux 200 2 0.149 0.048 0.183 ( 0) 0.115 ( 1)RHStendencies 800 2 0.421 0.148 0.526 ( 0) 0.317 ( 1)Vdtotal 1600 2 25.702 1.361 26.665 ( 0) 24.740 ( 1)Vdm 800 2 23.851 1.118 24.642 ( 0) 23.060 ( 1)vdmfinish 800 2 2.794 1.010 3.508 ( 0) 2.080 ( 1)Vdn 800 2 1.848 0.246 2.022 ( 0) 1.674 ( 1)Flux 800 2 4.818 1.135 5.620 ( 1) 4.015 ( 0)Force 800 2 1.901 0.110 1.979 ( 1) 1.823 ( 0)RKdiff 800 2 1.247 0.415 1.540 ( 0) 0.953 ( 1)TimeDiff 800 2 0.736 0.182 0.865 ( 0) 0.608 ( 1)Sponge 800 2 0.364 0.092 0.429 ( 0) 0.299 ( 1)pre_trisol 200 2 0.112 0.027 0.131 ( 0) 0.093 ( 1)Trisol 200 2 0.667 0.078 0.722 ( 1) 0.612 ( 0)post_trisol 200 2 0.082 0.012 0.090 ( 0) 0.073 ( 1)Vdmints 200 2 3.603 0.135 3.699 ( 0) 3.508 ( 1)Pstadv 200 2 0.849 0.044 0.880 ( 1) 0.817 ( 0)

NCAR SEA 19

Utility functions

• To print current memory usage at any point in your code:– ret = GPTLprint_memusage (“user string”)

• Produces e.g.– GPTLprint_memusage: user string size=19.5 MB rss=2.1 MB

datastack=1.5 MB

• To auto-profile current memory usage (at both function entry and exit points) :– ret = GPTLsetoption (GPTLdopr_memusage, 1);

• Retrieve wallclock, usr, sys timestamps to user code:– ret = GPTLstamp (&wallclock, &usr, &sys);

NCAR SEA 20

Future Work

• XML output• Port to GPU• Dynamic thread allocation for PTHREADS

option• Autoconf?

NCAR SEA 21

Source and Documentation

• Source: https://github.com/jmrosinski/GPTL– git clone [email protected]:jmrosinski/GPTL.git

• Web-based documentation:– jmrosinski.github.io/GPTL

• Feel free to email me: [email protected]

https://github.com/jmrosinski/GPTL

mailto:[email protected]:jmrosinski/GPTL.git

mailto:[email protected]

gptl: a simple and free general purpose tool for performance analysis and profiling april 8, 2014...

Documents

void gptlstop

void gptlstart

exit void

gptlstart compute compute

gptlstop compute

ncar sea5most

printf compute

gptlstop total