Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu
http://www.cs.uoregon.edu/research/tau
Department of Computer and Information Science
Performance Research Laboratory
University of Oregon
TAU Meets Dyninst and MRNet:A Long-term and Short-term Affair
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 2
Performance Research Lab
Dr. Sameer Shende, Senior scientist Alan Morris, Senior software engineer Wyatt Spear, Software engineer Scott Biersdorff, Software engineer Li Li, Ph.D. student
“Model-based Automatic Performance Diagnosis” Ph.D. thesis, January 2007
Kevin Huck, Ph.D. student Aroon Nataraj, Ph.D. student
Integrated kernel / application performance analysis Scalable performance monitoring
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 4
TAU Performance System Tuning and Analysis Utilities (14+ year project effort) Performance system framework for HPC systems
Integrated, scalable, flexible, and parallel Multiple parallel programming paradigms Parallel performance mapping methodology
Portable (open source) parallel performance system Instrumentation, measurement, analysis, and visualization Portable performance profiling and tracing facility Performance data management and data mining
Scalable (very large) parallel performance analysis Partners
Research Center Jülich, LLNL, ANL, LANL, UTK
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 5
TAU Performance Observation Methodology
Advocate event-based, direct performance observation Observe execution events
Types: control flow, state-based, user-defined Modes: atomic, interval (enter/exit)
Instrument program code directly (defines events) Modify program code at points of event occurrence Different code forms (source, library, object, binary, VM)
Measurement code inserted (instantiates events) Make events visible Measures performance related to event occurrence
Contrast with event-based sampling
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 6
TAU Performance System Architecture
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 7
TAU Performance System Architecture
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 8
User-level abstractions problem domain
source code
source code
object code libraries
instrumentation
instrumentation
executable
runtime image
compiler
linker
OS
VM
instrumentation
instrumentation
instrumentation
instrumentation
instrumentation
instrumentationperformancedata run
preprocessor
Multi-Level Instrumentation and Mapping
Multiple interfaces Information sharing
Between interfaces Event selection
Within levels Between levels
Mapping Performance data is
associated with high-level semantic abstractions
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 9
TAU Instrumentation Approach
Support for standard program events Routines, classes and templates Statement-level blocks and loops
Support for user-defined events Begin/End events (“user-defined timers”) Atomic events (e.g., size of memory allocated/freed) Selection of event statistics
Support definition of “semantic” entities for mapping Support for event groups (aggregation, selection) Instrumentation selection and optimization
Instrumentation enabling/disabling and runtime throttling
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 10
TAU Instrumentation Mechanisms Source code
Manual (TAU API, TAU component API) Automatic (robust)
C, C++, F77/90/95 (Program Database Toolkit (PDT)) OpenMP (directive rewriting (Opari), POMP2 spec)
Object code Pre-instrumented libraries (e.g., MPI using PMPI) Statically-linked and dynamically-linked
Executable code Dynamic instrumentation (pre-execution) (DyninstAPI) Virtual machine instrumentation (e.g., Java using JVMPI)
TAU_COMPILER to automate instrumentation process
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 11
TAU Measurement Approach
Portable and scalable parallel profiling solution Multiple profiling types and options Event selection and control (enabling/disabling, throttling) Online profile access and sampling Online performance profile overhead compensation
Portable and scalable parallel tracing solution Trace translation to EPILOG, VTF3, and OTF Trace streams (OTF) and hierarchical trace merging
Robust timing and hardware performance support Multiple counters (hardware, user-defined, system) Measurement specification separate from instrumentation
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 12
TAU Measurement Mechanisms
Parallel profiling Function-level, block-level, statement-level Supports user-defined events and mapping events TAU parallel profile stored (dumped) during execution Support for flat, callgraph/callpath, phase profiling Support for memory profiling (headroom, leaks)
Tracing All profile-level events Inter-process communication events Inclusion of multiple counter data in traced events
Compile-time and runtime measurement selection
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 13
Performance Analysis and Visualization
Analysis of parallel profile and trace measurement Parallel profile analysis
ParaProf: parallel profile analysis and presentation ParaVis: parallel performance visualization package Profile generation from trace data (tau2pprof)
Performance data management framework (PerfDMF) Parallel trace analysis
Translation to VTF (V3.0), EPILOG, OTF formats Integration with VNG (Technical University of Dresden)
Online parallel analysis and visualization Integration with CUBE browser (KOJAK, UTK, FZJ)
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 14
TAU and DyninstAPI TAU has had a long-term affair Dyninst technology Dyninst offered a binary-level instrumentation tool
Could help in cases when the source code is unavailable Could allow instrumentation without recompilation
TAU requirements Instrument HPC applications with TAU measurements Multiple paradigms, languages, compilers, platforms Portability
Tested Dyninst features as they were released Issues
MPI, threading, availability, binary rewriting It been on/off open relationship
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 15
Using DyninstAPI
TAU uses DyninstAPI for binary code patching Pre-execution
versus at any point during execution Methods
runtime before the application begins binary rewriting
tau_run (mutator) Loads TAU measurement library Uses DyninstAPI to instrument mutatee Can apply instrumentation selection
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 16
Using DyninstAPI with TAU
Configure TAU with Dyninst and build <taudir>/<arch>/bin/tau_run
% configure –dyninst=/usr/local/dyninstAPI-5.0.1
% make clean; make install
tau_run command
% tau_run [<-o outfile>] [-Xrun<libname>][-f <select_inst_file>] [-v] <infile>
Instrument all events with TAU measurement library and execute:
% tau_run klargest
Instrument all events with TAU+PAPI measurements (libTAUsh-papi.so) and execute:
% tau_run -XrunTAUsh-papi a.out
Instruments only events specified in select.tau instrumentation specification file and execute:
% tau_run -f select.tau a.out
Binary rewriting:
% tau_run –o a.inst.out a.out
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 17
Runtime Instrumentation with DyninstAPI
tau_run loads TAU’s shared object in the address space Selects routines to be instrumented Calls DyninstAPI OneTimeCode
Register a startup routine Pass a string of routine (event) names
“main | foo | bar” IDs assigned to events
TAU’s hooks for entry/exit used for instrumentation Invoked during execution
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 18
Using DyninstAPI with MPI
One mutator per mutatee
Each mutator instruments mutatee prior to execution No central control Each mutatee writes its own performance data to disk
% mpirun -np 4 ./run.sh
% cat run.sh
#!/bin/sh
/usr/local/tau-2.x/x86_64/bin/tau_run <path>/a.out
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 19
Binary Rewriting with TAU
Rewrite binary (Save the world) before executing
No central control No need to re-instrument the code on all backend nodes Each mutatee writes its own performance data to disk
% tau_run -o a.inst.out a.out
% cd _dyninstsaved0
% mpirun -np 4 ./a.inst.out
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 20
Example
EK-SIMPLE benchmark CFD benchmark Andy Shaw, Kofi Fynn Adapted by Brad Chamberlain
Experimentation Run on 4 cpus Runtime instrumentation using DyninstAPI and tau_run Measure wallclock time and CPU time experiments Profiling and tracing modes of measurement
Look at performance data with Paraprof and Vampir
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 21
ParaProf - Main Window (4 cpus)
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 22
ParaProf - Indivdual Profile (n,c,t 0,0,0)
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 23
ParaProf - Statistics Table (Mean)
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 24
ParaProf - net_recv (MPI rank 1)
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 25
Integrated Instrumentation (Source + Dyninst)
Use source instrumentation for some events Use Dyninst for other events Access same TAU measurement infrastructure Demonstrate on matrix multiplication example
Compare regular versus strip-mining versionsSource instrumented
Source + binary instrumented
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 26
TAU-over-MRNET (ToM) Project
MRNET
as a
Transport Substrate
in
TAU
(Reporting early work done in the last week.)
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 27
TAU Transport Substrate - Motivations Transport Substrate
Enables movement of measurement-related data TAU, in the past, has relied on shared file-system
Some Modes of Performance Observation Offline / Post-mortem observation and analysis
least requirements for a specialized transport Online observation
long running applications, especially at scale dumping to file-system can be suboptimal
Online observation with feedback into application in addition, requires that the transport is bi-directional
Performance observation problems and requirements are a function of the mode
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 28
Requirements Improve performance of transport
NFS can be slow and variable Specialization and remoting of FS-operations to front-end
Data Reduction At scale, cost of moving data too high Sample in different domain (node-wise, event-wise)
Control Selection of events, measurement technique, target nodes What data to output, how often and in what form? Feedback into the measurement system, feedback into application
Online, distributed processing of generated performance data Use compute resource of transport nodes Global performance analyses within the topology Distribute statistical analyses
easy (mean, variance, histogram), challenging (clustering)
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 29
Approach and First Prototype
Measurement and measured data transport are separate No such distinction in TAU
Created abstraction to separate and hide transport TauOutput
Did not create a custom transport for TAU Use existing monitoring/transport capabilities
Supermon (Sottile and Minnich, LANL) Piggy-backed TAU performance data on Supermon
channels Correlate system-level metrics from Supermon with TAU
application performance data
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 30
Rationale
Moved away from NFS Separation of concerns
Scalability, portability, robustness Addressed independent of TAU
Re-use existing technologies where appropriate Multiple bindings
Use different solutions best suited to particular platform Implementation speed
Easy, fast to create adapter that binds to existing transport MRNET support was added in about a week
Says a lot about usability of MRNET
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 31
ToM Architecture TAU Components*
Front-End (FE) Filters Back-End (BE)
* Over MRNet API No-Backend-Instantiation
mode Push-Pull model of data
retrieval No daemon
Instrumented application contains TAU and Back-End
Two channels (streams) Data (BE to FE) Control (FE to BE)
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 32
ToM Architecture Applicaton calls into TAU
Per-Iteration explicit call to output routine
Periodic calls using alarm TauOutput object invoked
Configuration specific:compile or runtime
One per thread TauOutput mimics subset of FS-style
operations Avoids changes to TAU code If required rest of TAU can be
made aware of output type Non-blocking recv for control Back-end pushes Sink pulls
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 33
Simple Example (NPB LU - A, Per-5 iterations)
Exclusivetime
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 34
Simple Example (NPB LU - A, Per-5 iterations)
Numberof calls
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 35
Comparing ToM with NFS TAUoverNFS versus TAUoverMRNET 250 ssor iterations 251 TAU_DB_DUMP operations
Significant advantages with specialized transport substrate Similar when using Supermon as the substrate
Remoting of expensive FS meta-data operations to Front-End
NPB LU (A)
32 Processors
Over NFS (secs)
Over MRNET(secs)
% Improvementover NFS
Total Runtime 45.51 36.50 19.79
TAU_DB_DUMP 8.58 0.366 95.73
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 36
Playing with Filters Downstream (FE to BE) multicast path
Even without filters, is very useful for control Data Reduction Filters are integral to Upstream path (BE
to FE) W/O filters loss-less data reproduced D-1 times Unnecessary large cost to network
Filter 1: Random Sampling Filter Very simplistic data reduction by node-wise sampling Accepts or Rejects packets probabilistically TAU Front-End can control probability P(accept)
P(accept)=K/N (N = # leafs, K is user constant) Bounds number of packets per-round to K
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 37
Filter 1 in Action (Ring application) Compare different P(accept) values
1, 1/4, 1/16 Front-End unable to keep up
Queuing delay propagated back
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 38
Other Filters Statistics filter
Reduce raw performance data to smaller set of statistics Distribute these statistical analyses from Front-End to the filters Simple measures - mean, std.dev, histograms More sophisticated measures - distributed clustering
Controlling filters No direct way to control Upstream-filters
not on control path Recommended solution
place upstream filters that work in concert with downstream filters to share control information
requires synchronization of state between upstream and downstream filters Our Echo hack
Back-Ends transparently echo Filter-Control packets back upstream this is then interpreted by the filters easier to implement control response time may be greater
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 39
Feedback / Suggestions Easy to integrate with MRNET
Good examples documentation, readable source code Setup phase
Make MRNET intermediate nodes listen on pre-specified port Allow arbitrary mrnet-ranks to connect and then set the Ids in the
topology Relaxing strict apriori-ranks can make setup easier Setup in Job-Q environments difficult
Packetization API can be more flexible Current API is usable and simple (var-arg printf style) Composing a packet over a series of staggered stages difficult
Allow control over how buffering is performed Important in a push-pull model as data injection points (rates)
independent of data retrieval Is not a problem in a purely pull model
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 40
TAUoverMRNET - Contrast TAUoverSupermon Supermon (cluster-monitor) vs. MRNet (reduction-network)
Both light-weight transport substrates Data format
Supermon: ascii s-expressions MRNET: packets with packed (binary?) data
Supermon Setup Loose topology No support/help in setting up intermediate nodes Assume Supermon is part of the environment
MRNET Setup Strict topology Better support for starting intermediate nodes With/Without Back-End instantiation (TAU uses latter)
Multiple Front-Ends (or sinks) possible with Supermon MRNET, front-end needs to program this functionality No exisiting pluggable “filter” support in Supermon Performing aggregation is more difficult with Supermon. Supermons allows buffer-policy specification, MRNET does not
TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 41
Future Work
Dyninst Tighter integration of source and binary instrumentation Conveying of source information to binary level Enabling use of TAU’s advanced measurement features Leveraging TAU’s performance mapping support Want robust and portable binary rewriting tool
MRNet Development of more performance filters Evaluation of MRNet performance for different scenarios Testing at large scale Use in applications