using tau on sicortex alan morris, aroon nataraj sameer shende, allen d. malony university of oregon...

24
Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj , sameer, malony}@cs.uoregon.edu

Upload: shanon-lamb

Post on 16-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer, malony}@cs.uoregon.edu

Using TAU on SiCortexAlan Morris, Aroon Nataraj

Sameer Shende, Allen D. MalonyUniversity of Oregon

{amorris, anataraj, sameer, malony}@cs.uoregon.edu

Page 2: Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer, malony}@cs.uoregon.edu

TAU Performance System 2

Outline

What is TAU? Instrumentation Measurement Invoking on SiCortex - tauex Demo

sweep3D Visualizing performance results

Page 3: Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer, malony}@cs.uoregon.edu

TAU Performance System 3

TAU Performance System

Tuning and Analysis Utilities (13+ year project effort) Performance measurement framework for HPC systems

Portable, scalable, flexible, and parallel Integrated toolkit for performance problem solving

Automatic instrumentation Highly configurable measurement system with support

for many flavors of profiling and tracing Portable analysis and visualization tools Performance data management and data mining

http://tau.uoregon.edu

Page 4: Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer, malony}@cs.uoregon.edu

TAU Performance System 4

TAU Instrumentation Approach

Support for standard program events Routines Classes and templates Finer-grain -- loop-level

Support for user-defined events Begin/End events (“user-defined timers”) Atomic events (e.g., size of memory allocated/freed)

Support for event groups Selective examination of performance data Runtime disabling of groups

Instrumentation optimization Selective instrumentation of events (only instrument needed) tau_reduce - generate selective instrumentation file

Page 5: Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer, malony}@cs.uoregon.edu

TAU Performance System 5

TAU Instrumentation

Flexible instrumentation mechanisms at multiple levels Source code

manual (TAU API, CCA TAU Component API) automatic

C, C++, F77/90/95 (Program Database Toolkit (PDT))OpenMP (directive rewriting (Opari), POMP spec)

Library level pre-instrumented libraries (e.g., MPI using PMPI) statically-linked and dynamically-linked

Executable code dynamic instrumentation (pre-execution) (DynInstAPI) virtual machine instrumentation (e.g., Java using JVMPI)

Runtime Linking (LD_PRELOAD)

Page 6: Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer, malony}@cs.uoregon.edu

TAU Performance System 6

Automatic Instrumentation

We provide compiler wrapper scripts Simply replace mpif90 with tauf90 Automatically instruments Fortran source code, links

with TAU MPI Wrapper libraries. Use taucc and taucxx for C/C++BeforeCXX = mpicxx

F90 = mpif90

CFLAGS =

LIBS = -lm

OBJS = f1.o f2.o f3.o … fn.o

app: $(OBJS)

$(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS)

.cpp.o:

$(CC) $(CFLAGS) -c $<

AfterCXX = taucxx

F90 = tauf90

CFLAGS =

LIBS = -lm

OBJS = f1.o f2.o f3.o … fn.o

app: $(OBJS)

$(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS)

.cpp.o:

$(CC) $(CFLAGS) -c $<

Page 7: Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer, malony}@cs.uoregon.edu

TAU Performance System 7

Measurement Options

Flat profiles Time (or counts) spent in each routine (nodes in callgraph). Exclusive/inclusive time, no. of calls, child calls Support for hardware counters (PAPI), multiple counters.

Callpath Profiles Flat profiles, plus Time spent along a calling path (edges in callgraph) E.g., “main=> f1 => f2 => MPI_Send” shows the time spent in

MPI_Send when called by f2, when f2 is called by f1, when it is called by main.

Configurable callpath depth limit (TAU_CALLPATH_DEPTH environment variable)

Tracing VAMPIRTRACE TAU Trace format; Converters: tau2slog2, tau2vtf, tau2otf

Page 8: Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer, malony}@cs.uoregon.edu

TAU Performance System 8

Running Applications with TAU on SiCortex

New tool, tauex

Usage: tauex [options] [--] <exe> <exe options>

Options:

-d: Enable debugging output, use repeatedly for more output.

-h: Print this message.

-i: Print information about the host machine.

-s: Dump the shell environment variables and exit.

-U: User mode counts

-K: Kernel mode counts

-S: Supervisor mode counts

-I: Interrupt mode counts

-l: List events

-L <event> : Describe event

-a: Count all native events (implies -m)

-m: Multiple runs (enough runs of exe to gather all events)

-e <event> : Specify PAPI preset or native event

-T <OPENMP,PROFILE,CALLPATH,TRACE,VAMPIRTRACE,EPILOG,DISABLE> : specify TAU option

-v: Debug/Verbose mode

-XrunTAU-<options> : specify TAU library directly

Page 9: Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer, malony}@cs.uoregon.edu

TAU Performance System 9

Demo

Application Sweep3D Build standard sweep3d application (un-instrumented) Run standard (un-instrumented) sweep3d using tauex

Provides MPI-only profiles Build sweep3d with automatic TAU instrumentor Run instrumented sweep3d with tauex

Provides application-level and MPI events

Page 10: Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer, malony}@cs.uoregon.edu

TAU Performance System 10

Using tauex

Use with uninstrumented executable for MPI profiling and tracing

$ mpif90 ring.f90 –o ring

$ srun -pscx -n4 tauex ./ring

$ cd ring.tau.1669/MULTI__P_WALL_CLOCK_TIME

$ pprof

…FUNCTION SUMMARY (mean):

---------------------------------------------------------------------------------------

%Time Exclusive Inclusive #Call #Subrs Inclusive Name

msec total msec usec/call

---------------------------------------------------------------------------------------

100.0 3 338 1 8 338086 .TAU application

59.5 201 201 1 0 201180 MPI_Init()

37.2 125 125 1 0 125889 MPI_Finalize()

2.1 6 6 1 0 6971 MPI_Barrier()

0.2 0.626 0.626 1 0 626 MPI_Recv()

0.1 0.247 0.247 1 0 247 MPI_Bcast()

0.0 0.102 0.102 1 0 102 MPI_Send()

0.0 0.00875 0.00875 1 0 9 MPI_Comm_size()

0.0 0.00525 0.00525 1 0 5 MPI_Comm_rank()

Page 11: Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer, malony}@cs.uoregon.edu

TAU Performance System 11

Using TAU compiler wrappers

$ tauf90 ring.f90 –o ring

# verbose output shows each step

Debug: Parsing with PDT Parser

Executing> /usr/share/PDT/mips/bin/f95parse ring.f90 -I/home/amorris/usr/include

Debug: Instrumenting with TAU

Executing> /usr/bin/tau_instrumentor ring.pdb ring.f90 -o ring.inst.f90

Debug: Compiling (Individually) with Instrumented Code

Executing> pathf95 -mabi=64 -I. -c ring.inst.f90 -o ring.o

Debug: Linking (Together) object files

Executing> pathf95 ring.o -mabi=64 -lTAU -Wl,-rpath -lpfm -lpapi -lpfm -lpthread -L/usr/lib/gcc/mips64el-gentoo-linux-gnu/4.1.2/ -lstdc++ -lgcc_s -lscmpi -o ring

Debug: cleaning inst file

Executing> /bin/rm -f ring.inst.f90

Debug: cleaning PDB file

Executing> /bin/rm -f ring.pdb

Page 12: Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer, malony}@cs.uoregon.edu

TAU Performance System 12

Using tauex

Use with instrumented executables for Application+MPI profiling and tracing

$ tau_f90.sh ring.f90 –o ring

$ srun -pscx -n4 tauex –e CPU_CYCLES ./ring

$ cd ring.tau.1674/MULTI__CPU_CYCLES

$ pprofFUNCTION SUMMARY (mean):

---------------------------------------------------------------------------------------

%Time Exclusive Inclusive #Call #Subrs Count/Call Name

counts total counts

---------------------------------------------------------------------------------------

100.0 4.552E+05 2.557E+07 1 5 25568675 MAIN

81.1 2.074E+07 2.074E+07 1 0 20742874 MPI_Init()

16.3 1.419E+05 4.176E+06 1 4 4176389 FUNC

14.2 3.639E+06 3.639E+06 1 0 3638847 MPI_Barrier()

0.8 2.091E+05 2.091E+05 1 0 209063 MPI_Recv()

0.7 1.875E+05 1.875E+05 1 0 187542 MPI_Finalize()

0.6 1.539E+05 1.539E+05 1 0 153880 MPI_Bcast()

0.1 3.27E+04 3.27E+04 1 0 32702 MPI_Send()

0.0 4337 4337 1 0 4337 MPI_Comm_size()

0.0 2308 2308 1 0 2308 MPI_Comm_rank()

Page 13: Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer, malony}@cs.uoregon.edu

TAU Performance System 13

Using tauex

Other typical usage scenarios

# floating point instruction counts and time (compute derived # FLOPS in ParaProf)

$ tauex –e P_WALL_CLOCK_TIME –e PAPI_FP_INS <app>

# Generate callpath profiles

$ tauex –e PAPI_FP_INS –T callpath <app>

# Generate OTF traces for Vampir

$ tauex –e PAPI_FP_INS –T vampirtrace <app>

# Generate Epilog traces for Kojak

$ tauex –T epilog <app>

Page 14: Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer, malony}@cs.uoregon.edu

TAU Performance System 14

Using TAU with FLASH on SiCortex

To use TAU with FLASH on SiCortex platforms, simply specify –tau=<path/to/stub/makefile> in setup

# On Full Disclosure:

$ ./setup Sedov -2d -auto -site=sicortex -objdir=tau -tau=/home/amorris/usr/share/TAU/64/Makefile.tau-multiplecounters-pathcc-mpi-papi-pdt

# Build

$ cd tau ; make

# Run (using tauex)

# This will generate callpath profiles with time and floating point instruction metrics

$ srun -pscx -n16 tauex -T callpath –e P_WALL_CLOCK_TIME –e PAPI_FP_INS ./flash3

Page 15: Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer, malony}@cs.uoregon.edu

TAU Performance System 15

Using ParaProf

Not yet available on SiCortex mips nodes (requires Java) For now, tar up <app>.tau.<jobid> directory and copy to

another machine ParaProf can be run on Linux, Windows, or Mac

If you don’t have TAU/paraprof, but have Java Web Start, visit http://tau.uoregon.edu/paraprof to run it

scx-m23-n6> tar czf flash3.tau.1578.tar.gz flash3.tau.1578

scx-m32-n6> scp flash3.tau.1578.tar.gz somewhere-else:

scx-m32-n6> ssh somewhere-else

se> tar –xzf flash3.tau.1578.tar.gz

se> paraprof flash3.tau.1578

Page 16: Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer, malony}@cs.uoregon.edu

TAU Performance System 16

ParaProf – Full Profile (FLASH)

MPI_Barrier MPI_Barrier IO routines IO routines

Page 17: Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer, malony}@cs.uoregon.edu

TAU Performance System 17

ParaProf - Statistics Table (Uintah)

Page 18: Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer, malony}@cs.uoregon.edu

TAU Performance System 18

ParaProf –Callgraph View (MFIX)

Page 19: Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer, malony}@cs.uoregon.edu

TAU Performance System 19

ParaProf – Histogram View (Miranda)

8k processors 16k processors

Scalable 2D displays

Page 20: Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer, malony}@cs.uoregon.edu

TAU Performance System 20

ParaProf – 3D Full Profile (Miranda)

16k processors

Page 21: Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer, malony}@cs.uoregon.edu

TAU Performance System 21

ParaProf – 3D Scatterplot (Miranda)

Each pointis a “thread”of execution

Relation

between four

routines

shown at

once

Page 22: Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer, malony}@cs.uoregon.edu

TAU Performance System 22

Tracing (Vampir) - Uintah

Trace analysis provides in-depth understanding of temporal event and message passing relationships

Traces can even store hardware counters

Page 23: Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer, malony}@cs.uoregon.edu

TAU Performance System 23

VNG Timeline Display (Miranda on BGL)

Page 24: Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer, malony}@cs.uoregon.edu

TAU Performance System 24

Thank You

TAU should soon be part of the SiCortex standard install Check out: http://tau.uoregon.edu