Transcript
Page 1: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

Optimization of Instrumentation in Parallel Performance Evaluation Tools

Sameer Shende, Allen D. Malony, Alan MorrisUniversity of Oregon

{sameer, malony,amorris}@cs.uoregon.edu PARA’06: MS8: Tools for Parallel Performance Analysis, 2:40pm – 3pm, Mon 6/19/06

Page 2: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 2

Outline

Overview of features Instrumentation Measurement (Profiling, Tracing) Analysis tools

Tools and techniques for optimizing instrumentation Conclusions

Page 3: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 3

TAU Performance System

Tuning and Analysis Utilities (14+ year project effort) Performance system framework for HPC systems

Integrated, scalable, portable, flexible, and parallel Integrated toolkit for performance problem solving

Automatic instrumentation Highly configurable measurement system with support

for many flavors of profiling and tracing Portable analysis and visualization tools Performance data management and data mining

http://www.cs.uoregon.edu/research/tau

Page 4: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 4

TAU Performance System Architecture

eventselection

Page 5: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 5

TAU Performance System Architecture

Page 6: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 6

Program Database Toolkit (PDT)

Application/ Library

C / C++parser

Fortran parserF77/90/95

C / C++IL analyzer

FortranIL analyzer

ProgramDatabase

Files

IL IL

DUCTAPE

PDBhtml

SILOON

CHASM

TAU_instr

Programdocumentation

Applicationcomponent glue

C++ / F90/95interoperability

Automatic sourceinstrumentation

Page 7: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 7

ParaProf – Manager Window

performancedatabase

derived performance metrics

Page 8: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 8

ParaProf – Full Profile (Miranda)

8K processors!

Page 9: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 9

ParaProf - Statistics Table (Uintah)

Page 10: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 10

ParaProf – 3D Full Profile (Miranda)

16k processors

Page 11: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 11

ParaProf – 3D Scatterplot (Miranda)

Each pointis a “thread”of execution

Relation

between four

routines

shown at

once

Page 12: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 12

TAU Instrumentation Approach

Support for standard program events Routines Classes and templates Statement-level blocks

Support for user-defined events Begin/End events (“user-defined timers”) Atomic events (e.g., size of memory allocated/freed)

Support definition of “semantic” entities for mapping Support for event groups Instrumentation optimization (eliminate instrumentation

in lightweight routines)

Page 13: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 13

Sampling vs Measured Profiling

Sampling At a sample, PC or callstack is examined Estimate performance of the program based on samples taken in code

regions Fixed overhead, depends on inter-sample interval Typically used in gprof, prof and other system profilers

Measured Profiling Instrumentation calls inserted at code regions

Entry/exit from routine, outer-loops, “events” Accurate measurements, compensation for timer overheads possible Accuracy inversely proportional to the granularity of instrumentation

Coarse grained instrumentation is more accurate Overhead of instrumentation depends on event frequency

Optimize instrumentation to capture necessary detail, eliminate instrumentation in frequently executing lightweight routines

Used in TAU

Page 14: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 14

TAU Instrumentation

Flexible instrumentation mechanisms at multiple levels Source code

manual (TAU API, TAU Component API) automatic

C, C++, F77/90/95 (Program Database Toolkit (PDT))OpenMP (directive rewriting (Opari), POMP spec)

Object code pre-instrumented libraries (e.g., MPI using PMPI) statically-linked and dynamically-linked

Executable code dynamic instrumentation (pre-execution) (DynInstAPI) virtual machine instrumentation (e.g., Java using JVMPI)

Runtime Linking (LD_PRELOAD)

Page 15: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 15

PAPI [UTK]

Performance Application Programming Interface The purpose of the PAPI project is to design,

standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors.

Parallel Tools Consortium project University of Tennessee, Knoxville http://icl.cs.utk.edu/papi

Page 16: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 16

KOJAK

KOJAK Toolkit [ICL, UTK and FZJ, Germany] Epilog tracing library Opari OpenMP re-writing tool Expert automatic bottleneck detection trace

analyzer CUBE performance data browser

http://icl.cs.utk.edu/kojak

Page 17: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 17

Automatic Instrumentation

We now provide compiler wrapper scripts Simply replace mpxlf90 with tau_f90.sh Automatically instruments Fortran source code, links

with TAU MPI Wrapper libraries. Use tau_cc.sh and tau_cxx.sh for C/C++BeforeCXX = mpCC

F90 = mpxlf90_r

CFLAGS =

LIBS = -lm

OBJS = f1.o f2.o f3.o … fn.o

app: $(OBJS)

$(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS)

.cpp.o:

$(CC) $(CFLAGS) -c $<

AfterCXX = tau_cxx.sh

F90 = tau_f90.sh

CFLAGS =

LIBS = -lm

OBJS = f1.o f2.o f3.o … fn.o

app: $(OBJS)

$(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS)

.cpp.o:

$(CC) $(CFLAGS) -c $<

Page 18: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 18

AutoInstrumentation using TAU_COMPILER

$(TAU_COMPILER) stub Makefile variable in 2.14+ release Invokes PDT parser, TAU instrumentor, compiler through

tau_compiler.sh shell script Requires minimal changes to application Makefile

Compilation rules are not changed User sets TAU_MAKEFILE and TAU_OPTIONS environment

variables User renames the compilers

F90=xlf90 to F90= tau_f90.sh

Passes options from TAU stub Makefile to the four compilation stages

Uses original compilation command if an error occurs

Page 19: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 19

TAU_COMPILER Options Optional parameters for $(TAU_COMPILER): [tau_compiler.sh –help]

-optVerbose Turn on verbose debugging messages -optPdtDir="" PDT architecture directory. Typically $(PDTDIR)/$(PDTARCHDIR) -optPdtF95Opts="" Options for Fortran parser in PDT (f95parse) -optPdtCOpts="" Options for C parser in PDT (cparse). Typically

$(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS) -optPdtCxxOpts="" Options for C++ parser in PDT (cxxparse). Typically

$(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS) -optPdtF90Parser="" Specify a different Fortran parser. For e.g., f90parse instead of f95parse -optPdtUser="" Optional arguments for parsing source code -optPDBFile="" Specify [merged] PDB file. Skips parsing phase. -optTauInstr="" Specify location of tau_instrumentor. Typically

$(TAUROOT)/$(CONFIG_ARCH)/bin/tau_instrumentor -optTauSelectFile="" Specify selective instrumentation file for tau_instrumentor -optTau="" Specify options for tau_instrumentor -optCompile="" Options passed to the compiler. Typically

$(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS) -optLinking="" Options passed to the linker. Typically

$(TAU_MPI_FLIBS) $(TAU_LIBS) $(TAU_CXXLIBS) -optNoMpi Removes -l*mpi* libraries during linking (default) -optKeepFiles Does not remove intermediate .pdb and .inst.* files

e.g., % setenv TAU_OPTIONS ‘-optTauSelectFile=select.tau –optVerbose

-optPdtCOpts=“-I/home -DFOO” ’% setenv TAU_MAKEFILE

/usr/local/tau-2.15.4/ia64/lib/Makefile.tau-icpc-mpi-pdt% tau_cxx.sh matrix.cpp -o matrix –lm% tau_f90.sh foo.o bar.o –o app –lm

Page 20: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 20

Optimization of Instrumentation Overhead

Group routines into profile groups, runtime selection of profiling groups

Instrument sections of code selectively Exclude or include list of routines fed to the instrumentor –

controlled manually or automatically Rule based control of instrumentation

Generate selective instrumentation file by examining performance data from a previous run

Page 21: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 21

tau_reduce: Rule-Based Overhead Analysis

Analyze the performance data to determine events with high (relative) overhead performance measurements

Create a select list for excluding those events Rule grammar (used in tau_reduce tool)

[GroupName:] Field Operator Number GroupName indicates rule applies to events in group Field is a event metric attribute (from profile statistics)

numcalls, numsubs, percent, usec, cumusec, count [PAPI], totalcount, stdev, usecs/call, counts/call

Operator is one of >, <, or = Number is any number Compound rules possible using & between simple rules

Page 22: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 22

Optimizing Instrumentation Overhead: Examples

#Exclude all events that are members of TAU_USER #and use less than 1000 microsecondsTAU_USER:usec < 1000

#Exclude all events that have less than 100 #microseconds and are called only onceusec < 1000 & numcalls = 1

#Exclude all events that have less than 1000 usecs per #call OR have a (total inclusive) percent less than 5usecs/call < 1000percent < 5

Scientific notation can be used usec>1000 & numcalls>400000 & usecs/call<30 & percent>25

Page 23: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 23

TAU_REDUCE

Reads profile files and rules Creates selective instrumentation file

Specifies which routines should be excluded from instrumentation

tau_reduce

rules

profile

Selectiveinstrumentation file

Page 24: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 24

Instrumentation Specification% tau_instrumentor

Usage : tau_instrumentor <pdbfile> <sourcefile> [-o <outputfile>] [-noinline] [-g groupname] [-i headerfile] [-c|-c++|-fortran] [-f <instr_req_file> ]

For selective instrumentation, use –f option

% tau_instrumentor foo.pdb foo.cpp –o foo.inst.cpp –f selective.dat

% cat selective.dat

# Selective instrumentation: Specify an exclude/include list of routines/files.

BEGIN_EXCLUDE_LIST

void quicksort(int *, int, int)

void sort_5elements(int *)

void interchange(int *, int *)

END_EXCLUDE_LIST

BEGIN_FILE_INCLUDE_LIST

Main.cpp

Foo?.c

*.C

END_FILE_INCLUDE_LIST

# Instruments routines in Main.cpp, Foo?.c and *.C files only

# Use BEGIN_[FILE]_INCLUDE_LIST with END_[FILE]_INCLUDE_LIST

Page 25: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 25

Optimization of Instrumentation Overhead (contd.)

Runtime throttling of events based on rule Numcalls > ThresholdA and TimePerCall < ThresholdB setenv TAU_THROTTLE 1 setenv TAU_THROTTLE_NUMCALLS <no> setenv TAU_THROTTLE_PERCALL <value> Default values:

<no> = 100000 calls <value> = 10 microseconds per call

The next call to meet these conditions is disabled at runtime and put in a TAU_DISABLE group

Page 26: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 26

EPILOG Tracing Optimization

TAU and Epilog Tracing Package TAU can generate epilog trace files

configure –epilog=<dir> -TRACE … Epilog uses its own MPI wrapper library Events are analyzed by Expert to detect performance bottlenecks

automatically Output is a CUBE profile file with callpath information CUBE output read by CUBE GUI and TAU’s ParaProf profile

browser Expert discards all events do not call an MPI call

directly/indirectly Optimization opportunity for instrumentation

Page 27: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 27

Runtime Instrumentation Control

When TAU is configured with –MPITRACE configuration option (without EPILOG support)

TAU stores events and wallclock time in a buffer Defers writing buffer to disk until an MPI call takes place Events directly in callstack are enabled and written to disk Other events are discarded TAU traces are converted to Epilog traces (tau2elg) Expert has minimal set of events

Page 28: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 28

Callpath Profiling Based Selective Instrumentation

TAU is configured with –PROFILECALLPATH Env. variable TAU_CALLPATH_DEPTH set to a large value

Callpaths rooted at “main”

TAU profiles analyzed to produce an “include list” list of routines that should be instrumented (tauinc.sh) [F. Wolf]

Events that call an MPI routine directly/indirectly TAU generates EPILOG traces Expert analyzes EPILOG traces to produce CUBE profiles ParaProf and CUBE browsers read CUBE files PerfDMF performance database stores bottleneck results

Page 29: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 29

Conclusions

Optimization of instrumentation is critical for balancing the volume of performance data generated

Several techniques for reducing the amount of instrumentation

Page 30: Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu

TAU Performance System 30

Support Acknowledgements

Department of Energy (DOE) Office of Science contracts University of Utah ASC Level 1

sub-contract LLNL ASC/NNSA Level 3 contract LLNL ParaTools/GWT contract

NSF High-End Computing Grant

T.U. Dresden, GWT Dr. Wolfgang Nagel and Holger Brunst

Research Centre Juelich Dr. Bernd Mohr, Dr. Felix Wolf

Los Alamos National Laboratory contracts


Top Related