tau performance system sameer shende, allen d. malony university of oregon {sameer,...

141
TAU Performance System Sameer Shende , Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889, LLNL

Post on 20-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemSameer Shende, Allen D. Malony

University of Oregon{sameer, malony}@cs.uoregon.edu

Workshop Jan 9-10, 2006. Classroom 1, T 1889, LLNL

Page 2: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 2

Acknowledgements

Alan Morris [UO] Holger Brunst and Wolfgang Nagel [TU Dresden] Bernd Mohr [Research Center Juelich, Germany] Wyatt Spear [UO]

Page 3: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 3

Research Motivation

Tools for performance problem solving Empirical-based performance optimization process Performance technology concerns

characterization

PerformanceTuning

PerformanceDiagnosis

PerformanceExperimentation

PerformanceObservation

hypotheses

properties

• Instrumentation• Measurement• Analysis• Visualization

PerformanceTechnology

• Experimentmanagement

• Performancestorage

PerformanceTechnology

Page 4: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 4

Outline of Talk

Overview of TAU Instrumentation Measurement Analysis: ParaProf and Vampir/VNG Performance data management and data mining

Performance Data Management Framework (PerfDMF) PerfExplorer

Multi-experiment case studies Clustering analysis

Future work and concluding remarks

Page 5: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 5

TAU Performance System Tuning and Analysis Utilities (13+ year project effort) Performance system framework for HPC systems

Integrated, scalable, flexible, and parallel Targets a general complex system computation model

Entities: nodes / contexts / threads Multi-level: system / software / parallelism Measurement and analysis abstraction

Integrated toolkit for performance problem solving Instrumentation, measurement, analysis, and visualization Portable performance profiling and tracing facility Performance data management and data mining

http://www.cs.uoregon.edu/research/tau

Page 6: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 6

Definitions – Profiling

Profiling Recording of summary information during execution

inclusive, exclusive time, # calls, hardware statistics, … Reflects performance behavior of program entities

functions, loops, basic blocks user-defined “semantic” entities

Very good for low-cost performance assessment Helps to expose performance bottlenecks and hotspots Implemented through

sampling: periodic OS interrupts or hardware counter traps instrumentation: direct insertion of measurement code

Page 7: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 7

Definitions – Tracing

Tracing Recording of information about significant points (events)

during program execution entering/exiting code region (function, loop, block, …) thread/process interactions (e.g., send/receive message)

Save information in event record timestamp CPU identifier, thread identifier Event type and event-specific information

Event trace is a time-sequenced stream of event records Can be used to reconstruct dynamic program behavior Typically requires code instrumentation

Page 8: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 8

Event Tracing: Instrumentation, Monitor, Trace

1 master

2 slave

3 ...

void slave { trace(ENTER, 2); ... recv(A, tag, buf); trace(RECV, A); ... trace(EXIT, 2);}

void master { trace(ENTER, 1); ... trace(SEND, B); send(B, tag, buf); ... trace(EXIT, 1);}

MONITOR 58 A ENTER 1

60 B ENTER 2

62 A SEND B

64 A EXIT 1

68 B RECV A

...

69 B EXIT 2

...

CPU A:

CPU B:

Event definition

timestamp

Page 9: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 9

Event Tracing: “Timeline” Visualization

1 master

2 slave

3 ...

58 A ENTER 1

60 B ENTER 2

62 A SEND B

64 A EXIT 1

68 B RECV A

...

69 B EXIT 2

...

mainmasterslave

58 60 62 64 66 68 70

B

A

Page 10: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 10

TAU Parallel Performance System Goals

Multi-level performance instrumentation Multi-language automatic source instrumentation

Flexible and configurable performance measurement Widely-ported parallel performance profiling system

Computer system architectures and operating systems Different programming languages and compilers

Support for multiple parallel programming paradigms Multi-threading, message passing, mixed-mode, hybrid

Support for performance mapping Support for object-oriented and generic programming Integration in complex software, systems, applications

Page 11: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 11

TAU Performance System Architecture

eventselection

Page 12: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 12

TAU Performance System Architecture

Page 13: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 13

Program Database Toolkit (PDT)

Application/ Library

C / C++parser

Fortran parserF77/90/95

C / C++IL analyzer

FortranIL analyzer

ProgramDatabase

Files

IL IL

DUCTAPE

PDBhtml

SILOON

CHASM

TAU_instr

Programdocumentation

Applicationcomponent glue

C++ / F90/95interoperability

Automatic sourceinstrumentation

Page 14: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 14

TAU Instrumentation Approach

Support for standard program events Routines Classes and templates Statement-level blocks

Support for user-defined events Begin/End events (“user-defined timers”) Atomic events (e.g., size of memory allocated/freed) Selection of event statistics

Support definition of “semantic” entities for mapping Support for event groups Instrumentation optimization (eliminate instrumentation

in lightweight routines)

Page 15: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 15

TAU Instrumentation

Flexible instrumentation mechanisms at multiple levels Source code

manual (TAU API, TAU Component API) automatic

C, C++, F77/90/95 (Program Database Toolkit (PDT))OpenMP (directive rewriting (Opari), POMP spec)

Object code pre-instrumented libraries (e.g., MPI using PMPI) statically-linked and dynamically-linked

Executable code dynamic instrumentation (pre-execution) (DynInstAPI) virtual machine instrumentation (e.g., Java using JVMPI)

Proxy Components

Page 16: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 16

Using TAU – A tutorial

Configuration Instrumentation

Manual MPI – Wrapper interposition library PDT- Source rewriting for C,C++, F77/90/95 OpenMP – Directive rewriting Component based instrumentation – Proxy components Binary Instrumentation

DyninstAPI – Runtime Instrumentation/Rewriting binary Java – Runtime instrumentation Python – Runtime instrumentation

Measurement Performance Analysis

Page 17: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 17

TAU Measurement System Configuration configure [OPTIONS]

{-c++=<CC>, -cc=<cc>} Specify C++ and C compilers {-pthread, -sproc} Use pthread or SGI sproc threads -openmp Use OpenMP threads -jdk=<dir> Specify Java instrumentation (JDK) -opari=<dir> Specify location of Opari OpenMP tool -papi=<dir> Specify location of PAPI -pdt=<dir> Specify location of PDT -dyninst=<dir> Specify location of DynInst Package -mpi[inc/lib]=<dir> Specify MPI library instrumentation -shmem[inc/lib]=<dir> Specify PSHMEM library

instrumentation -python[inc/lib]=<dir> Specify Python instrumentation -epilog=<dir> Specify location of EPILOG -slog2[=<dir>] Specify location of SLOG2/Jumpshot -vtf=<dir> Specify location of VTF3 trace package -arch=<architecture> Specify architecture explicitly

(bgl,ibm64,ibm64linux…)

Page 18: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 18

TAU Measurement System Configuration configure [OPTIONS]

-TRACE Generate binary TAU traces -PROFILE (default) Generate profiles (summary) -PROFILECALLPATH Generate call path profiles -PROFILEPHASE Generate phase based profiles -PROFILEMEMORY Track heap memory for each

routine -PROFILEHEADROOM Track memory headroom to grow -MULTIPLECOUNTERS Use hardware counters + time -COMPENSATE Compensate timer overhead -CPUTIME Use usertime+system time -PAPIWALLCLOCK Use PAPI’s wallclock time -PAPIVIRTUAL Use PAPI’s process virtual time -SGITIMERS Use fast IRIX timers -LINUXTIMERS Use fast x86 Linux timers

Page 19: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 19

TAU Measurement Configuration – Examples ./configure -c++=xlC_r –pthread

Use TAU with xlC_r and pthread library under AIX Enable TAU profiling (default)

./configure -TRACE –PROFILE Enable both TAU profiling and tracing

./configure -c++=xlC_r -cc=xlc_r-papi=/usr/local/packages/papi -pdt=/usr/local/pdtoolkit-3.4 –arch=ibm64-mpiinc=/usr/lpp/ppe.poe/include-mpilib=/usr/lpp/ppe.poe/lib -MULTIPLECOUNTERS Use IBM’s xlC_r and xlc_r compilers with PAPI, PDT, MPI packages and

multiple counters for measurements Typically configure multiple measurement libraries Each configuration creates a unique <arch>/lib/Makefile.tau-<options>

stub makefile that corresponds to the configuration options specified. e.g., /usr/local/tau/tau-2.14.8/x86_64/lib/Makefile.tau-icpc-mpi-pdt /usr/local/tau/tau-2.14.8/x86_64/lib/Makefile.tau-icpc-mpi-pdt-trace

Page 20: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 20

Using TAU on IBM BG/L

Configure PDT: % configure –XLC –exec-prefix=bgl ; make clean install Use XLC compiler

Configure TAU for front-end: % configure ; make clean install Add <taudir>/ppc64/bin/ to your path

Configure TAU for back-end: % configure -arch=bgl –mpi –pdt=<dir> -pdt_c++=xlC

–c++=blrts_xlC –cc=blrts_xlc –fortran=ibm Use IBM’s Blue Gene/L blrts_xlC compilers for building the

library and xlC for building tau_instrumentor [-pdt_c++=xlC]. It executes on the front-end.

Libraries are built in <taudir>/bgl/lib/ directory

Page 21: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 21

TAU_SETUP: A GUI for Installing TAU

tau-2.x>./tau_setup

Page 22: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 22

Configuration Parameters in Stub Makefiles Each TAU Stub Makefile resides in <tau><arch>/lib directory Variables:

TAU_CXX Specify the C++ compiler used by TAU TAU_CC, TAU_F90 Specify the C, F90 compilers TAU_DEFS Defines used by TAU. Add to CFLAGS TAU_LDFLAGS Linker options. Add to LDFLAGS TAU_INCLUDE Header files include path. Add to CFLAGS TAU_LIBS Statically linked TAU library. Add to LIBS TAU_SHLIBS Dynamically linked TAU library TAU_MPI_LIBS TAU’s MPI wrapper library for C/C++ TAU_MPI_FLIBS TAU’s MPI wrapper library for F90 TAU_FORTRANLIBS Must be linked in with C++ linker for F90 TAU_CXXLIBS Must be linked in with F90 linker TAU_INCLUDE_MEMORY Use TAU’s malloc/free wrapper lib TAU_DISABLE TAU’s dummy F90 stub library TAU_COMPILER Instrument using tau_compiler.sh scriptNote: Not including TAU_DEFS in CFLAGS disables instrumentation in

C/C++ programs (TAU_DISABLE for f90).

Page 23: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 23

Using TAU Install TAU

% configure ; make clean install Instrument application

TAU Profiling API Typically modify application makefile

include TAU’s stub makefile, modify variables Set environment variables

directory where profiles/traces are to be stored name of merged trace file, retain intermediate trace files, etc.

Execute application% mpirun –np <procs> a.out;

Analyze performance data paraprof, vampir, pprof, paraver …

Page 24: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 24

Using TAU – A tutorial

Configuration Instrumentation

Manual MPI – Wrapper interposition library PDT- Source rewriting for C,C++, F77/90/95 OpenMP – Directive rewriting Component based instrumentation – Proxy components Binary Instrumentation

DyninstAPI – Runtime Instrumentation/Rewriting binary Java – Runtime instrumentation Python – Runtime instrumentation

Measurement Performance Analysis

Page 25: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 25

TAU Manual Instrumentation API for C/C++ Initialization and runtime configuration

TAU_PROFILE_INIT(argc, argv);TAU_PROFILE_SET_NODE(myNode);TAU_PROFILE_SET_CONTEXT(myContext);TAU_PROFILE_EXIT(message);TAU_REGISTER_THREAD();

Function and class methods for C++ only: TAU_PROFILE(name, type, group); TAU_PHASE( name, type, group);

Template TAU_TYPE_STRING(variable, type);

TAU_PROFILE(name, type, group);CT(variable);

User-defined timing TAU_PROFILE_TIMER(timer, name, type, group);

TAU_PROFILE_START(timer);TAU_PROFILE_STOP(timer);

Page 26: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 26

TAU Measurement API (continued) Defining application phases

TAU_PHASE_CREATE_STATIC( var, name, type, group); TAU_PHASE_CREATE_DYNAMIC( var, name, type, group); TAU_PHASE_START(var) TAU_PHASE_STOP(var)

User-defined events TAU_REGISTER_EVENT(variable, event_name);

TAU_EVENT(variable, value);TAU_PROFILE_STMT(statement);

Heap Memory Tracking: TAU_TRACK_MEMORY(); TAU_TRACK_MEMORY_HEADROOM(); TAU_SET_INTERRUPT_INTERVAL(seconds); TAU_DISABLE_TRACKING_MEMORY[_HEADROOM](); TAU_ENABLE_TRACKING_MEMORY[_HEADROOM]();

Page 27: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 27

Manual Instrumentation – C++ Example

#include <TAU.h>

int main(int argc, char **argv)

{

TAU_PROFILE(“int main(int, char **)”, “ ”, TAU_DEFAULT);

TAU_PROFILE_INIT(argc, argv);

TAU_PROFILE_SET_NODE(0); /* for sequential programs */

foo();

return 0;

}

int foo(void)

{

TAU_PROFILE(“int foo(void)”, “ ”, TAU_DEFAULT); // measures entire foo()

TAU_PROFILE_TIMER(t, “foo(): for loop”, “[23:45 file.cpp]”, TAU_USER);

TAU_PROFILE_START(t);

for(int i = 0; i < N ; i++){

work(i);

}

TAU_PROFILE_STOP(t);

// other statements in foo …

}

Page 28: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 28

Manual Instrumentation – F90 Examplecc34567 Cubes program – comment line

PROGRAM SUM_OF_CUBES

integer profiler(2)

save profiler

INTEGER :: H, T, U

call TAU_PROFILE_INIT()

call TAU_PROFILE_TIMER(profiler, 'PROGRAM SUM_OF_CUBES')

call TAU_PROFILE_START(profiler)

call TAU_PROFILE_SET_NODE(0)

! This program prints all 3-digit numbers that

! equal the sum of the cubes of their digits.

DO H = 1, 9

DO T = 0, 9

DO U = 0, 9

IF (100*H + 10*T + U == H**3 + T**3 + U**3) THEN

PRINT "(3I1)", H, T, U

ENDIF

END DO

END DO

END DO

call TAU_PROFILE_STOP(profiler)

END PROGRAM SUM_OF_CUBES

Page 29: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 29

TAU’s MPI Wrapper Interposition Library Uses standard MPI Profiling Interface

Provides name shifted interface MPI_Send = PMPI_Send Weak bindings

Interpose TAU’s MPI wrapper library between MPI and TAU -lmpi replaced by –lTauMpi –lpmpi –lmpi

No change to the source code! Just re-link the application to generate performance data setenv TAU_MAKEFILE <dir>/<arch>/lib/Makefile.tau-mpi-[options]

Use tau_cxx.sh, tau_f90.sh and tau_cc.sh as compilers

Page 30: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 30

Using Program Database Toolkit (PDT)

1. Parse the Program to create foo.pdb:

% cxxparse foo.cpp –I/usr/local/mydir –DMYFLAGS …

or

% cparse foo.c –I/usr/local/mydir –DMYFLAGS …

or

% f95parse foo.f90 –I/usr/local/mydir …

% f95parse *.f –omerged.pdb –I/usr/local/mydir –R free

2. Instrument the program:% tau_instrumentor foo.pdb foo.f90 –o foo.inst.f90

–f select.tau

3. Compile the instrumented program:% ifort foo.inst.f90 –c –I/usr/local/mpi/include –o foo.o

Page 31: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 31

Using TAU

Step 1: Configure and install TAU:

% configure -pdt=<dir> -mpiinc=<dir> -mpilib=<dir> -c++=icpc -cc=icc -fortran=intel

% make clean; make install

Builds <taudir>/<arch>/lib/Makefile.tau-<options>

% set path=($path <taudir>/<arch>/bin)

Step 2: Choose target stub Makefile

% setenv TAU_MAKEFILE

/san/cca/tau/tau-2.14.8/x86_64/lib/Makefile.tau-icpc-mpi-pdt

% setenv TAU_OPTIONS ‘-optVerbose -optKeepFiles’

(see tau_compiler.sh for all options)

Step 3: Use tau_f90.sh, tau_cxx.sh and tau_cc.sh as the F90, C++ or C compilers respectively.

% tau_f90.sh -c app.f90

% tau_f90.sh app.o -o app -lm -lblas

Or use these in the application Makefile.

Page 32: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 32

AutoInstrumentation using TAU_COMPILER

$(TAU_COMPILER) stub Makefile variable in 2.14+ release Invokes PDT parser, TAU instrumentor, compiler through

tau_compiler.sh shell script Requires minimal changes to application Makefile

Compilation rules are not changed User sets TAU_MAKEFILE and TAU_OPTIONS environment

variables User renames the compilers

F90=xlf90 to F90= tau_f90.sh

Passes options from TAU stub Makefile to the four compilation stages

Uses original compilation command if an error occurs

Page 33: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 33

Tau_[cxx,cc,f90].sh – Improves Integration in Makefiles

OLDinclude /usr/tau-2.14/include/MakefileCXX = mpCCF90 = mpxlf90_rPDTPARSE = $(PDTDIR)/

$(PDTARCHDIR)/bin/cxxparseTAUINSTR = $(TAUROOT)/$(CONFIG_ARCH)/

bin/tau_instrumentorCFLAGS = $(TAU_DEFS) $(TAU_INCLUDE)LIBS = $(TAU_MPI_LIBS) $(TAU_LIBS) -lmOBJS = f1.o f2.o f3.o … fn.o

app: $(OBJS)$(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS)

.cpp.o:$(PDTPARSE) $<$(TAUINSTR) $*.pdb $< -o

$*.i.cpp –f select.dat$(CC) $(CFLAGS) -c $*.i.cpp

NEW# set TAU_MAKEFILE and TAU_OPTIONS env vars

CXX = tau_cxx.sh

F90 = tau_f90.sh

CFLAGS =

LIBS = -lm

OBJS = f1.o f2.o f3.o … fn.o

app: $(OBJS)

$(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS)

.cpp.o:

$(CC) $(CFLAGS) -c $<

Page 34: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 34

Using Stub Makefile and TAU_COMPILER

include /usr/common/acts/TAU/tau-2.14.8/rs6000/lib/Makefile.tau-mpi-pdt-trace

MYOPTIONS= -optVerbose –optKeepFiles

F90 = $(TAU_COMPILER) $(MYOPTIONS) mpxlf90

OBJS = f1.o f2.o f3.o …

LIBS = -Lappdir –lapplib1 –lapplib2 …

app: $(OBJS)

$(F90) $(OBJS) –o app $(LIBS)

.f90.o:

$(F90) –c $<

Page 35: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 35

TAU_COMPILER Options Optional parameters for $(TAU_COMPILER): [tau_compiler.sh –help]

-optVerbose Turn on verbose debugging messages -optPdtDir="" PDT architecture directory. Typically $(PDTDIR)/$(PDTARCHDIR) -optPdtF95Opts="" Options for Fortran parser in PDT (f95parse) -optPdtCOpts="" Options for C parser in PDT (cparse). Typically

$(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS) -optPdtCxxOpts="" Options for C++ parser in PDT (cxxparse). Typically

$(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS) -optPdtF90Parser="" Specify a different Fortran parser. For e.g., f90parse instead of f95parse -optPdtUser="" Optional arguments for parsing source code -optPDBFile="" Specify [merged] PDB file. Skips parsing phase. -optTauInstr="" Specify location of tau_instrumentor. Typically

$(TAUROOT)/$(CONFIG_ARCH)/bin/tau_instrumentor -optTauSelectFile="" Specify selective instrumentation file for tau_instrumentor -optTau="" Specify options for tau_instrumentor -optCompile="" Options passed to the compiler. Typically

$(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS) -optLinking="" Options passed to the linker. Typically

$(TAU_MPI_FLIBS) $(TAU_LIBS) $(TAU_CXXLIBS) -optNoMpi Removes -l*mpi* libraries during linking (default) -optKeepFiles Does not remove intermediate .pdb and .inst.* files

e.g., % setenv TAU_OPTIONS ‘-optTauSelectFile=select.tau –

optVerbose -optPdtCOpts=“-I/home -DFOO” ’% tau_cxx.sh matrix.cpp -o matrix -lm

Page 36: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 36

Instrumentation Specification% tau_instrumentor

Usage : tau_instrumentor <pdbfile> <sourcefile> [-o <outputfile>] [-noinline] [-g groupname] [-i headerfile] [-c|-c++|-fortran] [-f <instr_req_file> ]

For selective instrumentation, use –f option

% tau_instrumentor foo.pdb foo.cpp –o foo.inst.cpp –f selective.dat

% cat selective.dat

# Selective instrumentation: Specify an exclude/include list of routines/files.

BEGIN_EXCLUDE_LIST

void quicksort(int *, int, int)

void sort_5elements(int *)

void interchange(int *, int *)

END_EXCLUDE_LIST

BEGIN_FILE_INCLUDE_LIST

Main.cpp

Foo?.c

*.C

END_FILE_INCLUDE_LIST

# Instruments routines in Main.cpp, Foo?.c and *.C files only

# Use BEGIN_[FILE]_INCLUDE_LIST with END_[FILE]_INCLUDE_LIST

Page 37: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 37

tau_reduce: Rule-Based Overhead Analysis

Analyze the performance data to determine events with high (relative) overhead performance measurements

Create a select list for excluding those events Rule grammar (used in tau_reduce tool)

[GroupName:] Field Operator Number GroupName indicates rule applies to events in group Field is a event metric attribute (from profile statistics)

numcalls, numsubs, percent, usec, cumusec, count [PAPI], totalcount, stdev, usecs/call, counts/call

Operator is one of >, <, or = Number is any number Compound rules possible using & between simple rules

Page 38: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 38

Optimization of Program Instrumentation

Need to eliminate instrumentation in frequently executing lightweight routines

Throttling of events at runtime:% setenv TAU_THROTTLE 1

Turns off instrumentation in routines that execute over 10000 times (TAU_THROTTLE_NUMCALLS) and take less than 10 microseconds of inclusive time per call (TAU_THROTTLE_PERCALL)

Selective instrumentation file to filter events% tau_instrumentor [options] –f <file>

Compensation of local instrumentation overhead % configure -COMPENSATE

Page 39: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 39

TAU_REDUCE

Reads profile files and rules Creates selective instrumentation file

Specifies which routines should be excluded from instrumentation

tau_reduce

rules

profile

Selectiveinstrumentation file

Page 40: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 40

Optimizing Instrumentation Overhead: Rules

#Exclude all events that are members of TAU_USER #and use less than 1000 microsecondsTAU_USER:usec < 1000

#Exclude all events that have less than 100 #microseconds and are called only onceusec < 1000 & numcalls = 1

#Exclude all events that have less than 1000 usecs per #call OR have a (total inclusive) percent less than 5usecs/call < 1000percent < 5

Scientific notation can be used usec>1000 & numcalls>400000 & usecs/call<30 & percent>25

Page 41: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 41

Instrumentation of OpenMP Constructs OOpenMP PPragma AAnd RRegion IInstrumentor Source-to-Source translator to insert POMP calls

around OpenMP constructs and API functions Done: Supports

Fortran77 and Fortran90, OpenMP 2.0 C and C++, OpenMP 1.0 POMP Extensions EPILOG and TAU POMP implementations Preserves source code information (#line line file)

Work in Progress:Investigating standardization through OpenMP Forum

KOJAK Project website http://icl.cs.utk.edu/kojak

Page 42: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 42

OpenMP API Instrumentation

Transform omp_#_lock() pomp_#_lock() omp_#_nest_lock() pomp_#_nest_lock()

[ # = init | destroy | set | unset | test ]

POMP version Calls omp version internally Can do extra stuff before and after call

Page 43: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 43

Example: !$OMP PARALLEL DO Instrumentation

!$OMP PARALLEL DO clauses...

do loop

!$OMP END PARALLEL DO

!$OMP PARALLEL other-clauses...

!$OMP DO schedule-clauses, ordered-clauses, lastprivate-clausesdo loop

!$OMP END DO

!$OMP END PARALLEL DO

NOWAIT

!$OMP BARRIER

call pomp_parallel_fork(d)

call pomp_parallel_begin(d)

call pomp_parallel_end(d)

call pomp_parallel_join(d)

call pomp_do_enter(d)

call pomp_do_exit(d)

call pomp_barrier_enter(d)

call pomp_barrier_exit(d)

Page 44: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 44

Opari Instrumentation: Example

OpenMP directive instrumentation

pomp_for_enter(&omp_rd_2);

#line 252 "stommel.c"

#pragma omp for schedule(static) reduction(+: diff) private(j) firstprivate (a1,a2,a3,a4,a5) nowait

for( i=i1;i<=i2;i++) {

for(j=j1;j<=j2;j++){

new_psi[i][j]=a1*psi[i+1][j] + a2*psi[i-1][j] + a3*psi[i][j+1]

+ a4*psi[i][j-1] - a5*the_for[i][j];

diff=diff+fabs(new_psi[i][j]-psi[i][j]);

}

}

pomp_barrier_enter(&omp_rd_2);

#pragma omp barrier

pomp_barrier_exit(&omp_rd_2);

pomp_for_exit(&omp_rd_2);

#line 261 "stommel.c"

Page 45: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 45

Using Opari with TAU

Step I: Configure KOJAK/opari [Download from http://www.fz-juelich.de/zam/kojak/]

% cd kojak-2.1; cp mf/Makefile.defs.ibm Makefile.defs; edit Makefile

% make

Builds opari

Step II: Configure TAU with Opari (used here with MPI and PDT)

% configure –opari=/usr/contrib/TAU/kojak-2.1/opari -mpiinc=/usr/lpp/ppe.poe/include –mpilib=/usr/lpp/ppe.poe/lib –pdt=/usr/contrib/TAU/pdtoolkit-3.4

% make clean; make install

% setenv TAU_MAKEFILE /tau/<arch>/lib/Makefile.tau-…opari-…

% tau_cxx.sh -c foo.cpp

% tau_cxx.sh -c bar.f90

% tau_cxx.sh *.o -o app

Page 46: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 46

Work in Progress Eclipse PTP

Integration of TAU in Eclipse IDE TAU’s Java Plugin TAU’s Fortran 95/C++/C Plugin

Statement and loop level automatic instrumentation Memory tracking extensions ParaProf

Time-series profile display windows Trace to profile (tau2profile) conversion tool generates

profile snapshots periodically Online performance monitoring extensions

KTAU: Kernel performance monitoring package [ZeptoOS, ANL]

Page 47: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 47

Building Bridges to Other Tools: TAU

Page 48: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 48

TAU Performance System Interfaces PDT [U. Oregon, LANL, FZJ] for instrumentation of C++, C99, F95 source code PAPI [UTK] & PCL[FZJ] for accessing hardware performance counters data DyninstAPI [U. Maryland, U. Wisconsin] for runtime instrumentation KOJAK [FZJ, UTK]

Epilog trace generation library CUBE callgraph visualizer Opari OpenMP directive rewriting tool

Vampir/Intel® Trace Analyzer [Pallas/Intel] VTF3 trace generation library for Vampir [TU Dresden] (available from TAU website) Paraver trace visualizer [CEPBA] Jumpshot-4 trace visualizer [MPICH, ANL] JVMPI from JDK for Java program instrumentation [Sun] Paraprof profile browser/PerfDMF database supports:

TAU format Gprof [GNU] HPM Toolkit [IBM] MpiP [ORNL, LLNL] Dynaprof [UTK] PSRun [NCSA]

PerfDMF database can use Oracle, MySQL or PostgreSQL (IBM DB2 support planned)

Page 49: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 49

PAPI [UTK]

Performance Application Programming Interface The purpose of the PAPI project is to design,

standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors.

Parallel Tools Consortium project University of Tennessee, Knoxville http://icl.cs.utk.edu/papi

Page 50: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 50

Memory Profiling in TAU Configuration option –PROFILEMEMORY

Records global heap memory utilization for each function Takes one sample at beginning of each function and associates the sample

with function name Configuration option -PROFILEHEADROOM

Records headroom (amount of free memory to grow) for each function Takes one sample at beginning of each function and associates it with the

callstack [TAU_CALLPATH_DEPTH env variable] Useful for debugging memory usage on IBM BG/L.

Independent of instrumentation/measurement options selected No need to insert macros/calls in the source code User defined atomic events appear in profiles/traces

Page 51: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 51

Memory Profiling in TAU

Flash2 code profile (-PROFILEMEMORY) on IBM BlueGene/L [MPI rank 0]

Page 52: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 52

Memory Profiling in TAU

Instrumentation based observation of global heap memory (not per function) call TAU_TRACK_MEMORY() call TAU_TRACK_MEMORY_HEADROOM()

Triggers one sample every 10 secs call TAU_TRACK_MEMORY_HERE() call TAU_TRACK_MEMORY_HEADROOM_HERE()

Triggers sample at a specific location in source code call TAU_SET_INTERRUPT_INTERVAL(seconds)

To set inter-interrupt interval for sampling call TAU_DISABLE_TRACKING_MEMORY() call TAU_DISABLE_TRACKING_MEMORY_HEADROOM()

To turn off recording memory utilization call TAU_ENABLE_TRACKING_MEMORY() call TAU_ENABLE_TRACKING_MEMORY_HEADROOM()

To re-enable tracking memory utilization

Page 53: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 53

Using TAU’s Malloc Wrapper Library for C/C++

include /us/local/tools/tau/i386_linux/lib/Makefile.tau-pdt

CC=$(TAU_CC)

CFLAGS=$(TAU_DEFS) $(TAU_INCLUDE) $(TAU_MEMORY_INCLUDE)

LIBS = $(TAU_LIBS)

OBJS = f1.o f2.o ...

TARGET= a.out

TARGET: $(OBJS)

$(F90) $(LDFLAGS) $(OBJS) -o $@ $(LIBS)

.c.o:

$(CC) $(CFLAGS) -c $< -o $@

Page 54: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 54

TAU’s malloc/free wrapper for C/C++

#include <TAU.h>

#include <malloc.h>

int main(int argc, char **argv)

{

TAU_PROFILE(“int main(int, char **)”, “ ”, TAU_DEFAULT);

int *ary = (int *) malloc(sizeof(int) * 4096);

// TAU’s malloc wrapper library replaces this call automatically

// when $(TAU_MEMORY_INCLUDE) is used in the Makefile.

free(ary);

// other statements in foo …

}

Page 55: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 55

Using TAU’s Malloc Wrapper Library for C/C++

Page 56: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 56

Dynamic Instrumentation

TAU uses DyninstAPI for runtime code patching Developed by U. Wisconsin and U. Maryland http://www.dyninst.org tau_run (mutator) loads measurement library Instruments mutatee MPI issues:

one mutator per executable image [TAU, DynaProf] one mutator for several executables [Paradyn, DPCL]

Page 57: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 57

Using DyninstAPI with TAU

Step I: Install DyninstAPI[Download from http://www.dyninst.org]

% cd dyninstAPI-4.1/core; make

Set DyninstAPI environment variables (including LD_LIBRARY_PATH)

Step II: Configure TAU with Dyninst

% configure –dyninst=/usr/local/dyninstAPI-4.1

% make clean; make install

Builds <taudir>/<arch>/bin/tau_run

% tau_run [<-o outfile>] [-Xrun<libname>] [-f <select_inst_file>] [-v] <infile>

% tau_run –o a.inst.out a.out

Rewrites a.out

% tau_run klargest

Instruments klargest with TAU calls and executes it

% tau_run -XrunTAUsh-papi a.out

Loads libTAUsh-papi.so instead of libTAU.so for measurements

Page 58: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 58

Virtual Machine Performance Instrumentation

Integrate performance system with VM Captures robust performance data (e.g., thread events) Maintain features of environment

portability, concurrency, extensibility, interoperation Allow use in optimization methods

JVM Profiling Interface (JVMPI) Generation of JVM events and hooks into JVM Profiler agent (TAU) loaded as shared object

registers events of interest and address of callback routine Access to information on dynamically loaded classes No need to modify Java source, bytecode, or JVM

Page 59: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 59

Using TAU with Java Applications

Step I: Sun JDK 1.4+ [download from www.javasoft.com]

Step II: Configure TAU with JDK (v 1.2 or better)

% configure –jdk=/usr/java2 –TRACE -PROFILE

% make clean; make install

Builds <taudir>/<arch>/lib/libTAU.so

For Java (without instrumentation):

% java application

With instrumentation:

% java -XrunTAU application

% java -XrunTAU:exclude=sun/io,java application

Excludes sun/io/* and java/* classes

Page 60: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 60

TAU Profiling of Java Application (SciVis)

Profile for eachJava thread Captures events

for different Javapackages

24 threads of execution!

globalroutineprofile

Page 61: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 61

Using TAU with Python Applications

Step I: Configure TAU with Python

% configure –pythoninc=/usr/include/python2.2/include

% make clean; make install

Builds <taudir>/<arch>/lib/<bindings>/pytau.py and tau.py packages

for manual and automatic instrumentation respectively

% setenv PYTHONPATH $PYTHONPATH\:<taudir>/<arch>/lib/[<dir>]

Page 62: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 62

Python Automatic Instrumentation Example

#!/usr/bin/env/python

import tau

from time import sleep

def f2():

print “ In f2: Sleeping for 2 seconds ”

sleep(2)

def f1():

print “ In f1: Sleeping for 3 seconds ”

sleep(3)

def OurMain():

f1()

tau.run(‘OurMain()’)

Running:% setenv PYTHONPATH <tau>/<arch>/lib

% ./auto.py

Instruments OurMain, f1, f2, print…

Page 63: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 63

Performance Mapping

Associate performance with “significant” entities (events)

Source code points are important Functions, regions, control flow events, user events

Execution process and thread entities are important Some entities are more abstract, harder to measure

Page 64: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 64

TAU: An Overview

Instrumentation Measurement Analysis

Page 65: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 65

Performance Mapping in Callpath Profiling

Consider callgraph (callpath) profiling Measure time (metric) along an edge (path) of callgraph

Incident edge gives parent / child view Edge sequence (path) gives parent / descendant view

Callpath profiling when callgraph is unknown Must determine callgraph dynamically at runtime Map performance measurement to dynamic call path state

Callpath levels 1-level: current callgraph node/flat profile 2-level: immediate parent (descendant) k-level: kth nodes in the calling path

Page 66: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 66

k-Level Callpath Implementation in TAU

TAU maintains a performance event (routine) callstack Profiled routine (child) looks in callstack for parent

Previous profiled performance event is the parent A callpath profile structure created first time parent calls TAU records parent in a callgraph map for child String representing k-level callpath used as its key

“a( )=>b( )=>c()” : name for time spent in “c” when called by “b” when “b” is called by “a”

Map returns pointer to callpath profile structure k-level callpath is profiled using this profiling data Set environment variable TAU_CALLPATH_DEPTH to depth

Build upon TAU’s performance mapping technology Measurement is independent of instrumentation Use –PROFILECALLPATH to configure TAU

Page 67: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 67

k-Level Callpath Implementation in TAU

Page 68: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 68

Gprof Style Callpath View in Paraprof

Page 69: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 69

Profile Measurement – Three Flavors

Flat profiles Time (or counts) spent in each routine (nodes in callgraph). Exclusive/inclusive time, no. of calls, child calls E.g,: MPI_Send, foo, …

Callpath Profiles Flat profiles, plus Sequence of actions that led to poor performance Time spent along a calling path (edges in callgraph) E.g., “main=> f1 => f2 => MPI_Send” shows the time spent in MPI_Send

when called by f2, when f2 is called by f1, when it is called by main. Depth of this callpath = 4 (TAU_CALLPATH_DEPTH environment variable)

Phase based profiles Flat profiles, plus Flat profiles under a phase (nested phases are allowed) Default “main” phase has all phases and routines invoked outside phases Supports static or dynamic (per-iteration) phases E.g., “IO => MPI_Send” is time spent in MPI_Send in IO phase

Page 70: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 70

TAU Timers and Phases

Static timer Shows time spent in all invocations of a routine (foo) E.g., “foo()” 100 secs, 100 calls

Dynamic timer Shows time spent in each invocation of a routine E.g., “foo() 3” 4.5 secs, “foo 10” 2 secs (invocations 3 and 10 respectively)

Static phase Shows time spent in all routines called (directly/indirectly) by a given

routine (foo) E.g., “foo() => MPI_Send()” 100 secs, 10 calls shows that a total of 100

secs were spent in MPI_Send() when it was called by foo. Dynamic phase

Shows time spent in all routines called by a given invocation of a routine. E.g., “foo() 4 => MPI_Send()” 12 secs, shows that 12 secs were spent in

MPI_Send when it was called by the 4th invocation of foo.

Page 71: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 71

Static Timers in TAU

SUBROUTINE SUM_OF_CUBES

integer profiler(2)

save profiler

INTEGER :: H, T, U

call TAU_PROFILE_TIMER(profiler, 'SUM_OF_CUBES')

call TAU_PROFILE_START(profiler)

! This program prints all 3-digit numbers that

! equal the sum of the cubes of their digits.

DO H = 1, 9

DO T = 0, 9

DO U = 0, 9

IF (100*H + 10*T + U == H**3 + T**3 + U**3) THEN

PRINT "(3I1)", H, T, U

ENDIF

END DO

END DO

END DO

call TAU_PROFILE_STOP(profiler)

END SUBROUTINE SUM_OF_CUBES

Page 72: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 72

Static Phases and Timers

SUBROUTINE FOO

integer profiler(2)

save profiler

call TAU_PHASE_CREATE_STATIC(profiler, ‘foo')

call TAU_PHASE_START(profiler)

call bar()

! Here bar calls MPI_Barrier and we evaluate foo=>MPI_Barrier and foo=>bar

call TAU_PHASE_STOP(profiler)

END SUBROUTINE SUM_OF_CUBES

SUBROUTINE BAR

integer profiler(2)

save profiler

call TAU_PROFILE_TIMER(profiler, ‘bar’)

call TAU_PROFILE_START(profiler)

call MPI_Barrier()

call TAU_PROFILE_STOP(profiler)

END SUBROUTINE BAR

Page 73: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 73

Dynamic PhasesSUBROUTINE ITERATE(IER, NIT)

IMPLICIT NONE

INTEGER IER, NIT

character(11) taucharary

integer tauiteration / 0 /

integer profiler(2) / 0, 0 /

save profiler, tauiteration

write (taucharary, '(a8,i3)') 'ITERATE ', tauiteration

! Taucharary is the name of the phase e.g.,‘ITERATION 23’

tauiteration = tauiteration + 1

call TAU_PHASE_CREATE_DYNAMIC(profiler,taucharary)

call TAU_PHASE_START(profiler)

IER = 0

call SOLVE_K_EPSILON_EQ(IER)

! Other work

call TAU_PHASE_STOP(profiler)

Page 74: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 74

TAU’s ParaProf Profile Browser: Static Timers

Page 75: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 75

Dynamic Timers

Page 76: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 76

Static Phases

MPI_Barrier took4.85 secs out of 13.48 secs in the DTM Phase

Page 77: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 77

Dynamic Phases

The first iteration was expensive for INT_RTE. It took27.89 secs. Other iterations took less time – 14.2, 10.5, 10.3, 10.5 seconds

Page 78: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 78

Dynamic Phases

Time spent in MPI_Barrier, MPI_Recv,… in DTM ITERATION 1

Breakdown of time spent in MPI_Isend based on its static and dynamic parent phases

Page 79: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 79

Advances in TAU Performance Analysis

Enhanced parallel profile analysis (ParaProf) Callpath analysis integration in ParaProf Event callgraph view

Performance Data Management Framework (PerfDMF) First release of prototype

Integration with Vampir Next Generation (VNG) Online trace analysis

3D Performance visualization Component performance modeling and QoS

Page 80: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 80

Pprof – Flat Profile (NAS PB LU)

Intel Linux cluster

F90 + MPICH

Profile - Node - Context - Thread

Events - code - MPI

Metric

- time Text display

Page 81: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 81

Terminology – Example

For routine “int main( )”: Exclusive time

100-20-50-20=10 secs Inclusive time

100 secs Calls

1 call Subrs (no. of child

routines called) 3

Inclusive time/call 100secs

int main( )

{ /* takes 100 secs */

f1(); /* takes 20 secs */

f2(); /* takes 50 secs */

f1(); /* takes 20 secs */

/* other work */

}

/*

Time can be replaced by counts

from PAPI e.g., PAPI_FP_INS. */

Page 82: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 82

ParaProf – Manager Window

performancedatabase

derived performance metrics

Page 83: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 83

Performance Database: Storage of MetaData

Page 84: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 84

ParaProf – Full Profile (Miranda)

8K processors!

Page 85: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 85

ParaProf– Flat Profile (Miranda)

Page 86: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 86

ParaProf– Callpath Profile (Flash)

Page 87: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 87

ParaProf– Callpath Profile (ESMF)

21-levelcallpath

Page 88: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 88

Gprof Style Callpath View in Paraprof (SAGE)

Page 89: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 89

ParaProf – Phase Profile (MFIX)

In 51st iteration, time spent in MPI_Waitall was 85.81 secs

Total time spent in MPI_Waitall was4137.9 secs across all92 iterations

dynamic phasesone per interation

Page 90: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 90

ParaProf - Statistics Table (Uintah)

Page 91: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 91

ParaProf – Histogram View (Miranda)

8k processors 16k processors

Scalable 2D displays

Page 92: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 92

ParaProf –Callgraph View (MFIX)

Page 93: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 93

ParaProf – Callpath Highlighting (Flash)

MODULEHYDRO_1D:HYDRO_1D

Page 94: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 94

Profiling of Miranda on BG/L

128 Nodes 512 Nodes 1024 Nodes

Profile code performance (automatic instrumentation) [Brian Miller, CASC, LLNL]

Scaling studies (problem size, number of processors)

Run on 8K,16K and 32K processors!

Page 95: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 95

ParaProf – 3D Full Profile (Miranda)

16k processors

Page 96: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 96

ParaProf Bar Plot (Zoom in/out +/-)

Page 97: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 97

ParaProf – 3D Scatterplot (Miranda)

Each pointis a “thread”of execution

A total offour metricsshown inrelation

ParaVis 3Dprofilevisualizationlibrary JOGL

Page 98: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 98

Vampir, VNG, and OTF Commercial trace based tools developed at ZiH, T.U. Dresden

Wolfgang Nagel, Holger Brunst and others… Vampir Trace Visualizer (aka Intel ® Trace Analyzer v4.0)

Sequential program Vampir Next Generation (VNG)

Client (vng) runs on a desktop, server (vngd) on a cluster Parallel trace analysis Orders of magnitude bigger traces (more memory) State of the art in parallel trace visualization

Open Trace Format (OTF) Hierarchical trace format, efficient streams based parallel access with VNGD Replacement for proprietary formats such as STF Tracing library available on IBM BG/L platform

Development of OTF supported by LLNL contract

http://www.vampir-ng.de

Page 99: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 99

Vampir Next Generation (VNG) Architecture

MergedTraces

Analysis Server

Classic Analysis:

monolithic

sequential

Worker 1

Worker 2

Worker m

Master

Trace 1Trace 2

Trace 3Trace N

File System

InternetInternet

Parallel Program

Monitor System

Event Streams

Visualization Client

Segment Indicator

768 Processes Thumbnail

Timeline with 16 visible Traces

ProcessParallel

I/OMessage Passing

Page 100: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 100

VNG Parallel Analysis Server

Worker 1

Worker 2

Worker m

Master

Worker

Session Thread

Analysis Module

Event Databases

Message Passing

Trace Format Driver

Master

Session Thread

Analysis Merger

Endian Conversion

Message Passing

Socket Communication

VisualizationClient

M Worker

N Session Threads N Session Threads

Traces

Page 101: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 101

Scalability of VNG

sPPM 16 CPUs 200 MB

0,00

2,00

4,00

6,00

8,00

10,00

12,00

14,00

16,00

18,00

0 10 20 30 40

Number of Workers

Sp

ee

du

p

Com. Matrix

Timeline

Summary Profile

Process Profile

Stack Tree

LoadTime

Number of Workers 1 2 4 8 16 32Load Time 47,33 22,48 10,80 5,43 3,01 3,16Timeline 0,10 0,09 0,06 0,08 0,09 0,09Summary Profile 1,59 0,87 0,47 0,30 0,28 0,25Process Profile 1,32 0,70 0,38 0,26 0,17 0,17Com. Matrix 0,06 0,07 0,08 0,09 0,09 0,09Stack Tree 2,57 1,39 0,70 0,44 0,25 0,25

Page 102: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 102

VNG Analysis Server Architecture

Implementation using MPI and Pthreads Client/server approach MPI and pthreads are available on most platforms Workload and data distribution among “physical”

MPI processes Support of multiple visualization clients by using

virtual sessions handled by individual threads Sessions are scheduled as threads

Page 103: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 103

TAU Tracing Enhancements Configure TAU with -TRACE –vtf=<dir> –otf=<dir> options

% configure –TRACE –vtf=<dir> … % configure –TRACE –otf=<dir> …Generates tau_merge, tau2vtf, tau2otf tools in <tau>/<arch>/bin

directory% tau_f90.sh app.f90 –o app

Instrument and execute application % mpirun -np 4 app

Merge and convert trace files to VTF3/SLOG2 format % tau_treemerge.pl % tau2vtf tau.trc tau.edf app.vpt.gz % vampir foo.vpt.gz

OR % tau2otf tau.trc tau.edf app.otf –n <numstreams> % vampir app.otf

OR use VNG to analyze OTF/VTF trace files

Page 104: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 104

Environment Variables

Configure TAU with -TRACE –otf=<dir> option% configure –TRACE –otf=<dir> -MULTIPLECOUNTERS –papi=<dir> -mpi –pdt=dir …

Set environment variables% setenv TRACEDIR /p/gm1/<login>/traces% setenv COUNTER1 GET_TIME_OF_DAY (reqd)% setenv COUNTER2 PAPI_FP_INS% setenv COUNTER3 PAPI_TOT_CYC …

Execute application% srun –N8 –n16 –p pdebug ./a.out [args] tau_treemerge.pl and tau2otf/tau2vtf

Page 105: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 105

Using Vampir Next Generation (VNG v1.4)

Page 106: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 106

VNG Timeline Display

Page 107: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 107

VNG Calltree Display

Page 108: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 108

VNG Timeline Zoomed In

Page 109: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 109

VNG Grouping of Interprocess Communications

Page 110: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 110

VNG Process Timeline with PAPI Counters

Page 111: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 111

OTF/VNG Support for Counters

Page 112: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 112

VNG Communication Matrix Display

Page 113: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 113

VNG Message Profile

Page 114: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 114

VNG Process Activity Chart

Page 115: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 115

VNG Preferences

Page 116: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 116

TAU Performance System Status Computing platforms (selected)

IBM SP/pSeries/BGL, SGI Altix/Origin, Cray T3E/SV-1/X1/XT3, HP (Compaq) SC (Tru64), Sun, Linux clusters (IA-32/64, Alpha, PPC, PA-RISC, Power, Opteron), Apple (G4/5, OS X), Hitachi SR8000, NEC SX-5/6, Windows …

Programming languages C, C++, Fortran 77/90/95, HPF, Java, Python

Thread libraries (selected) pthreads, OpenMP, SGI sproc, Java,Windows, Charm++

Compilers (selected) Intel, PGI, GNU, Fujitsu, Sun, PathScale, SGI, Cray,

IBM, HP, NEC, Absoft, Lahey, Nagware

Page 117: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 117

Project Affiliations (selected)

Center for Simulation of Accidental Fires and Explosion University of Utah, ASCI ASAP Center, C-SAFE Uintah Computational Framework (UCF) (C++)

Center for Simulation of Dynamic Response of Materials California Institute of Technology, ASCI ASAP Center Virtual Testshock Facility (VTF) (Python, Fortran 90)

Earth Systems Modeling Framework (ESMF) NSF, NOAA, DOE, NASA, … Instrumentation for ESMF framework and applications C, C++, and Fortran 95 code modules MPI wrapper library for MPI calls

Page 118: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 118

Project Affiliations (selected) (continued) Lawrence Livermore National Laboratory

Hydrodynamics (Miranda) Sandia National Lab and Los Alamos National Lab

DOE CCTTSS SciDAC project Common component architecture (CCA) integration

Argonne National Lab Jumpshot SLOG2 SDK project ZeptoOS - scalable components for petascale

architectures KTAU - integration of TAU infrastructure in Linux

kernel Oak Ridge National Lab

Contribution to the Joule Report: S3D, AORSA3D

Page 119: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 119

Important Questions for Application Developers

How does performance vary with different compilers? Is poor performance correlated with certain OS features? Has a recent change caused unanticipated performance? How does performance vary with MPI variants? Why is one application version faster than another? What is the reason for the observed scaling behavior? Did two runs exhibit similar performance? How are performance data related to application events? Which machines will run my code the fastest and why? Which benchmarks predict my code performance best?

Page 120: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 120

Performance Problem Solving Goals Answer questions at multiple levels of interest

Data from low-level measurements and simulations use to predict application performance

High-level performance data spanning dimensions machine, applications, code revisions, data sets examine broad performance trends

Discover general correlations application performance and features of their external environment

Develop methods to predict application performance on lower-level metrics

Discover performance correlations between a small set of benchmarks and a collection of applications that represent a typical workload for a given system

Page 121: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 121

PerfDMF: Performance Data Mgmt. Framework

Page 122: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 122

TAU Performance Regression (PerfRegress)

Prototype developed by Alan Morris for Uintah Re-implement using PerfDMF – work in progress

Page 123: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 123

Empirical-Based Performance

Optimization Processes

. . . ORDB

Performance Data and Meta-Data

Performance Query & Analysis Toolkit

Performance Analysis Programs

PerfDML Translators

Raw Performance Data

PerfDML Data Description

Performance Tuning

Performance Diagnosis

Performance Experimentation

Performance Observation

hypotheses

properties

characterization

Experiment Schemas

PerfESL Scripts

Experiment Trials

Ob

serv

abili

ty

Req

uir

emen

ts

Source Code

Pre- processor

Instrumented Source Code Compiler

Object Code Linker

Executable Code

Dynamic

Virtual Machine

Llib

rari

es

Profile Groups

Function Database

Statistics

Function Callstack

Hardware Counters

User-Level Timers

Run-Time Library Modules

Profiling Data Files

Event Traces Event Tables

pprof merge convert

ASCII Report Trace Logs

Racy jRacy Vampir

PROFILE TRACE

TAU API

Inst

rum

enta

tion

M

easu

rem

ent

An

alys

is

Binary Rewrite

TAU Performance System

URV

URV

URV

Integrated Performance Evaluation Environment

Page 124: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 124

ParaProf Performance Profile Analysis

HPMToolkit

MpiP

TAU

Raw files

PerfDMFmanaged(database)

Metadata

Application

Experiment

Trial

Page 125: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 125

PerfExplorer Performance knowledge discovery framework

Use the existing TAU infrastructure TAU instrumentation data, PerfDMF

Client-server based system architecture Data mining analysis applied to parallel performance data

comparative, clustering, correlation, dimension reduction, ...

Technology integration Relational DatabaseManagement Systems (RDBMS) Java API and toolkit R-project / Omegahat statistical analysis WEKA data mining package Web-based client

Page 126: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 126

PerfExplorer Architecture

Page 127: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 127

PerfExplorer Client GUI

Page 128: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 128

Hierarchical and K-means Clustering (sPPM)

Page 129: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 129

Miranda Clustering on 16K Processors

Page 130: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 130

PERC Tool Requirements and Evaluation

Performance Evaluation Research Center (PERC) DOE SciDAC Evaluation methods/tools for high-end parallel systems

PERC tools study (led by ORNL, Pat Worley) In-depth performance analysis of select applications Evaluation performance analysis requirements Test tool functionality and ease of use

Applications Start with fusion code – GYRO Repeat with other PERC benchmarks Continue with SciDAC codes

Page 131: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 131

Primary Evaluation Machines

Phoenix (ORNL – Cray X1) 512 multi-streaming vector processors

Ram (ORNL – SGI Altix (1.5 GHz Itanium2)) 256 total processors

TeraGrid ~7,738 total processors on 15 machines at 9 sites

Cheetah (ORNL – p690 cluster (1.3 GHz, HPS)) 864 total processors on 27 compute nodes

Seaborg (NERSC – IBM SP3) 6080 total processors on 380 compute nodes

Page 132: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 132

GYRO Execution Parameters

Three benchmark problems B1-std : 16n processors, 500 timesteps B2-cy : 16n processors, 1000 timesteps B3-gtc: 64n processors, 100 timesteps (very large)

Test different methods to evaluate nonlinear terms: Direct method FFT (“nl2” for B1 and B2, “nl1” for B3)

Task affinity enabled/disabled (p690 only) Memory affinity enabled/disabled (p690 only) Filesystem location (Cray X1 only)

Page 133: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 133

PerfExplorer Analysis of Self-Instrumented Data

PerfExplorer Focus on comparative analysis Apply to PERC tool evaluation study

Look at user timer data Aggregate data

no per process data process clustering analysis is not applicable

Timings output every N timesteps some phase analysis possible

Goal Recreate manually generated performance reports

Page 134: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 134

PerfExplorer Interface

Select experiments and trials of interest

Data organized in application, experiment, trial structure(will allow arbitrary in future)

Experimentmetadata

Page 135: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 135

PerfExplorer Interface

Select analysis

Page 136: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 136

B1-std

B3-gtc

Timesteps per Second Cray X1 is the fastest to

solution in all 3 tests FFT (nl2) improves time for

B3-gtc only TeraGrid faster than p690 for

B1-std? Plots generated automatically

B1-std

B2-cy B3-gtc

TeraGrid

Page 137: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 137

Relative Efficiency (B1-std) By experiment (B1-std)

Total runtime (Cheetah (red)) By event for one experiment

Coll_tr (blue) is significant By experiment for one event

Shows how Coll_tr behaves for all experiments

16 processorbase case

Cheetah Coll_tr

Page 138: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 138

Current and Future Work Vampir/VNG

Generation of OTF traces natively in TAU ParaProf

Developing timestamped profile snapshot performance displays PerfDMF

Adding new database backends and distributed support Building support for user-created tables

PerfExplorer Extending comparative and clustering analysis Adding new data mining capabilities Building in scripting support

Performance regression testing tool (PerfRegress) Integrate in Eclipse Parallel Tool Project (PTP)

Page 139: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 139

Concluding Discussion Performance tools must be used effectively More intelligent performance systems for productive use

Evolve to application-specific performance technology Deal with scale by “full range” performance exploration Autonomic and integrated tools Knowledge-based and knowledge-driven process

Performance observation methods do not necessarily need to change in a fundamental sense More automatically controlled and efficiently use

Develop next-generation tools and deliver to community Open source with support by ParaTools, Inc. http://www.cs.uoregon.edu/research/tau

Page 140: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 140

TUTORIAL: Getting Started! Step 1: Set up paths

% set path=(/usr/local/tools/tau/<arch>/bin/ $path)

% set path=(/usr/local/tools/vampir $path)

% set path=(/usr/local/intel/compiler91_beta/bin $path) On mcr

Step 2: Set TAU environment variables% setenv TAU_MAKEFILE

/usr/local/tools/tau/i386_linux/lib/Makefile.tau-mpi-pdt

% setenv TRACEDIR /p/gm1/<login>/<dir>

% setenv TAU_THROTTLE 1

% setenv COUNTER1 GET_TIME_OF_DAY; setenv COUNTER2 PAPI_FP_INS…

Step 3: Build using tau_f90.sh, tau_cc.sh and tau_cxx.sh compilers Step 4: Visualize the data. Paraprof for profiles, VNG for traces Step 5: Choose a different measurement option!

Page 141: TAU Performance System Sameer Shende, Allen D. Malony University of Oregon {sameer, malony}@cs.uoregon.edu Workshop Jan 9-10, 2006. Classroom 1, T 1889,

TAU Performance SystemLLNL 141

Support Acknowledgements

Department of Energy (DOE) Office of Science contracts University of Utah ASC Level 1

sub-contract LLNL ASC/NNSA Level 3 contract LLNL ParaTools/GWT contract

NSF High-End Computing Grant

T.U. Dresden, GWT Dr. Wolfgang Nagel and Holger Brunst

Research Centre Juelich Dr. Bernd Mohr

Los Alamos National Laboratory contracts