using tau on sicortex alan morris, aroon nataraj sameer shende, allen d. malony university of oregon...
TRANSCRIPT
Using TAU on SiCortexAlan Morris, Aroon Nataraj
Sameer Shende, Allen D. MalonyUniversity of Oregon
{amorris, anataraj, sameer, malony}@cs.uoregon.edu
TAU Performance System 2
Outline
What is TAU? Instrumentation Measurement Invoking on SiCortex - tauex Demo
sweep3D Visualizing performance results
TAU Performance System 3
TAU Performance System
Tuning and Analysis Utilities (13+ year project effort) Performance measurement framework for HPC systems
Portable, scalable, flexible, and parallel Integrated toolkit for performance problem solving
Automatic instrumentation Highly configurable measurement system with support
for many flavors of profiling and tracing Portable analysis and visualization tools Performance data management and data mining
http://tau.uoregon.edu
TAU Performance System 4
TAU Instrumentation Approach
Support for standard program events Routines Classes and templates Finer-grain -- loop-level
Support for user-defined events Begin/End events (“user-defined timers”) Atomic events (e.g., size of memory allocated/freed)
Support for event groups Selective examination of performance data Runtime disabling of groups
Instrumentation optimization Selective instrumentation of events (only instrument needed) tau_reduce - generate selective instrumentation file
TAU Performance System 5
TAU Instrumentation
Flexible instrumentation mechanisms at multiple levels Source code
manual (TAU API, CCA TAU Component API) automatic
C, C++, F77/90/95 (Program Database Toolkit (PDT))OpenMP (directive rewriting (Opari), POMP spec)
Library level pre-instrumented libraries (e.g., MPI using PMPI) statically-linked and dynamically-linked
Executable code dynamic instrumentation (pre-execution) (DynInstAPI) virtual machine instrumentation (e.g., Java using JVMPI)
Runtime Linking (LD_PRELOAD)
TAU Performance System 6
Automatic Instrumentation
We provide compiler wrapper scripts Simply replace mpif90 with tauf90 Automatically instruments Fortran source code, links
with TAU MPI Wrapper libraries. Use taucc and taucxx for C/C++BeforeCXX = mpicxx
F90 = mpif90
CFLAGS =
LIBS = -lm
OBJS = f1.o f2.o f3.o … fn.o
app: $(OBJS)
$(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS)
.cpp.o:
$(CC) $(CFLAGS) -c $<
AfterCXX = taucxx
F90 = tauf90
CFLAGS =
LIBS = -lm
OBJS = f1.o f2.o f3.o … fn.o
app: $(OBJS)
$(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS)
.cpp.o:
$(CC) $(CFLAGS) -c $<
TAU Performance System 7
Measurement Options
Flat profiles Time (or counts) spent in each routine (nodes in callgraph). Exclusive/inclusive time, no. of calls, child calls Support for hardware counters (PAPI), multiple counters.
Callpath Profiles Flat profiles, plus Time spent along a calling path (edges in callgraph) E.g., “main=> f1 => f2 => MPI_Send” shows the time spent in
MPI_Send when called by f2, when f2 is called by f1, when it is called by main.
Configurable callpath depth limit (TAU_CALLPATH_DEPTH environment variable)
Tracing VAMPIRTRACE TAU Trace format; Converters: tau2slog2, tau2vtf, tau2otf
TAU Performance System 8
Running Applications with TAU on SiCortex
New tool, tauex
Usage: tauex [options] [--] <exe> <exe options>
Options:
-d: Enable debugging output, use repeatedly for more output.
-h: Print this message.
-i: Print information about the host machine.
-s: Dump the shell environment variables and exit.
-U: User mode counts
-K: Kernel mode counts
-S: Supervisor mode counts
-I: Interrupt mode counts
-l: List events
-L <event> : Describe event
-a: Count all native events (implies -m)
-m: Multiple runs (enough runs of exe to gather all events)
-e <event> : Specify PAPI preset or native event
-T <OPENMP,PROFILE,CALLPATH,TRACE,VAMPIRTRACE,EPILOG,DISABLE> : specify TAU option
-v: Debug/Verbose mode
-XrunTAU-<options> : specify TAU library directly
TAU Performance System 9
Demo
Application Sweep3D Build standard sweep3d application (un-instrumented) Run standard (un-instrumented) sweep3d using tauex
Provides MPI-only profiles Build sweep3d with automatic TAU instrumentor Run instrumented sweep3d with tauex
Provides application-level and MPI events
TAU Performance System 10
Using tauex
Use with uninstrumented executable for MPI profiling and tracing
$ mpif90 ring.f90 –o ring
$ srun -pscx -n4 tauex ./ring
$ cd ring.tau.1669/MULTI__P_WALL_CLOCK_TIME
$ pprof
…FUNCTION SUMMARY (mean):
---------------------------------------------------------------------------------------
%Time Exclusive Inclusive #Call #Subrs Inclusive Name
msec total msec usec/call
---------------------------------------------------------------------------------------
100.0 3 338 1 8 338086 .TAU application
59.5 201 201 1 0 201180 MPI_Init()
37.2 125 125 1 0 125889 MPI_Finalize()
2.1 6 6 1 0 6971 MPI_Barrier()
0.2 0.626 0.626 1 0 626 MPI_Recv()
0.1 0.247 0.247 1 0 247 MPI_Bcast()
0.0 0.102 0.102 1 0 102 MPI_Send()
0.0 0.00875 0.00875 1 0 9 MPI_Comm_size()
0.0 0.00525 0.00525 1 0 5 MPI_Comm_rank()
TAU Performance System 11
Using TAU compiler wrappers
$ tauf90 ring.f90 –o ring
# verbose output shows each step
Debug: Parsing with PDT Parser
Executing> /usr/share/PDT/mips/bin/f95parse ring.f90 -I/home/amorris/usr/include
Debug: Instrumenting with TAU
Executing> /usr/bin/tau_instrumentor ring.pdb ring.f90 -o ring.inst.f90
Debug: Compiling (Individually) with Instrumented Code
Executing> pathf95 -mabi=64 -I. -c ring.inst.f90 -o ring.o
Debug: Linking (Together) object files
Executing> pathf95 ring.o -mabi=64 -lTAU -Wl,-rpath -lpfm -lpapi -lpfm -lpthread -L/usr/lib/gcc/mips64el-gentoo-linux-gnu/4.1.2/ -lstdc++ -lgcc_s -lscmpi -o ring
Debug: cleaning inst file
Executing> /bin/rm -f ring.inst.f90
Debug: cleaning PDB file
Executing> /bin/rm -f ring.pdb
TAU Performance System 12
Using tauex
Use with instrumented executables for Application+MPI profiling and tracing
$ tau_f90.sh ring.f90 –o ring
$ srun -pscx -n4 tauex –e CPU_CYCLES ./ring
$ cd ring.tau.1674/MULTI__CPU_CYCLES
$ pprofFUNCTION SUMMARY (mean):
---------------------------------------------------------------------------------------
%Time Exclusive Inclusive #Call #Subrs Count/Call Name
counts total counts
---------------------------------------------------------------------------------------
100.0 4.552E+05 2.557E+07 1 5 25568675 MAIN
81.1 2.074E+07 2.074E+07 1 0 20742874 MPI_Init()
16.3 1.419E+05 4.176E+06 1 4 4176389 FUNC
14.2 3.639E+06 3.639E+06 1 0 3638847 MPI_Barrier()
0.8 2.091E+05 2.091E+05 1 0 209063 MPI_Recv()
0.7 1.875E+05 1.875E+05 1 0 187542 MPI_Finalize()
0.6 1.539E+05 1.539E+05 1 0 153880 MPI_Bcast()
0.1 3.27E+04 3.27E+04 1 0 32702 MPI_Send()
0.0 4337 4337 1 0 4337 MPI_Comm_size()
0.0 2308 2308 1 0 2308 MPI_Comm_rank()
TAU Performance System 13
Using tauex
Other typical usage scenarios
# floating point instruction counts and time (compute derived # FLOPS in ParaProf)
$ tauex –e P_WALL_CLOCK_TIME –e PAPI_FP_INS <app>
# Generate callpath profiles
$ tauex –e PAPI_FP_INS –T callpath <app>
# Generate OTF traces for Vampir
$ tauex –e PAPI_FP_INS –T vampirtrace <app>
# Generate Epilog traces for Kojak
$ tauex –T epilog <app>
TAU Performance System 14
Using TAU with FLASH on SiCortex
To use TAU with FLASH on SiCortex platforms, simply specify –tau=<path/to/stub/makefile> in setup
# On Full Disclosure:
$ ./setup Sedov -2d -auto -site=sicortex -objdir=tau -tau=/home/amorris/usr/share/TAU/64/Makefile.tau-multiplecounters-pathcc-mpi-papi-pdt
# Build
$ cd tau ; make
# Run (using tauex)
# This will generate callpath profiles with time and floating point instruction metrics
$ srun -pscx -n16 tauex -T callpath –e P_WALL_CLOCK_TIME –e PAPI_FP_INS ./flash3
TAU Performance System 15
Using ParaProf
Not yet available on SiCortex mips nodes (requires Java) For now, tar up <app>.tau.<jobid> directory and copy to
another machine ParaProf can be run on Linux, Windows, or Mac
If you don’t have TAU/paraprof, but have Java Web Start, visit http://tau.uoregon.edu/paraprof to run it
scx-m23-n6> tar czf flash3.tau.1578.tar.gz flash3.tau.1578
scx-m32-n6> scp flash3.tau.1578.tar.gz somewhere-else:
scx-m32-n6> ssh somewhere-else
se> tar –xzf flash3.tau.1578.tar.gz
se> paraprof flash3.tau.1578
TAU Performance System 16
ParaProf – Full Profile (FLASH)
MPI_Barrier MPI_Barrier IO routines IO routines
TAU Performance System 17
ParaProf - Statistics Table (Uintah)
TAU Performance System 18
ParaProf –Callgraph View (MFIX)
TAU Performance System 19
ParaProf – Histogram View (Miranda)
8k processors 16k processors
Scalable 2D displays
TAU Performance System 20
ParaProf – 3D Full Profile (Miranda)
16k processors
TAU Performance System 21
ParaProf – 3D Scatterplot (Miranda)
Each pointis a “thread”of execution
Relation
between four
routines
shown at
once
TAU Performance System 22
Tracing (Vampir) - Uintah
Trace analysis provides in-depth understanding of temporal event and message passing relationships
Traces can even store hardware counters
TAU Performance System 23
VNG Timeline Display (Miranda on BGL)
TAU Performance System 24
Thank You
TAU should soon be part of the SiCortex standard install Check out: http://tau.uoregon.edu