Download - Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu Department of Computer and Information Science Performance

Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu

http://www.cs.uoregon.edu/research/tau

Department of Computer and Information Science

Performance Research Laboratory

University of Oregon

TAU Meets Dyninst and MRNet:A Long-term and Short-term Affair

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 2

Performance Research Lab

Dr. Sameer Shende, Senior scientist Alan Morris, Senior software engineer Wyatt Spear, Software engineer Scott Biersdorff, Software engineer Li Li, Ph.D. student

“Model-based Automatic Performance Diagnosis” Ph.D. thesis, January 2007

Kevin Huck, Ph.D. student Aroon Nataraj, Ph.D. student

Integrated kernel / application performance analysis Scalable performance monitoring


TAU Performance System Tuning and Analysis Utilities (14+ year project effort) Performance system framework for HPC systems

Integrated, scalable, flexible, and parallel Multiple parallel programming paradigms Parallel performance mapping methodology

Portable (open source) parallel performance system Instrumentation, measurement, analysis, and visualization Portable performance profiling and tracing facility Performance data management and data mining

Scalable (very large) parallel performance analysis Partners

Research Center Jülich, LLNL, ANL, LANL, UTK


TAU Performance Observation Methodology

Advocate event-based, direct performance observation Observe execution events

Types: control flow, state-based, user-defined Modes: atomic, interval (enter/exit)

Instrument program code directly (defines events) Modify program code at points of event occurrence Different code forms (source, library, object, binary, VM)

Measurement code inserted (instantiates events) Make events visible Measures performance related to event occurrence

Contrast with event-based sampling


TAU Performance System Architecture


User-level abstractions problem domain

source code

source code

object code libraries

instrumentation

instrumentation

executable

runtime image

compiler

linker

OS

VM

instrumentation

instrumentation

instrumentation

instrumentation

instrumentation

instrumentationperformancedata run

preprocessor

Multi-Level Instrumentation and Mapping

Multiple interfaces Information sharing

Between interfaces Event selection

Within levels Between levels

Mapping Performance data is

associated with high-level semantic abstractions


TAU Instrumentation Approach

Support for standard program events Routines, classes and templates Statement-level blocks and loops

Support for user-defined events Begin/End events (“user-defined timers”) Atomic events (e.g., size of memory allocated/freed) Selection of event statistics

Support definition of “semantic” entities for mapping Support for event groups (aggregation, selection) Instrumentation selection and optimization

Instrumentation enabling/disabling and runtime throttling


TAU Instrumentation Mechanisms Source code

Manual (TAU API, TAU component API) Automatic (robust)

C, C++, F77/90/95 (Program Database Toolkit (PDT)) OpenMP (directive rewriting (Opari), POMP2 spec)

Object code Pre-instrumented libraries (e.g., MPI using PMPI) Statically-linked and dynamically-linked

Executable code Dynamic instrumentation (pre-execution) (DyninstAPI) Virtual machine instrumentation (e.g., Java using JVMPI)

TAU_COMPILER to automate instrumentation process


TAU Measurement Approach

Portable and scalable parallel profiling solution Multiple profiling types and options Event selection and control (enabling/disabling, throttling) Online profile access and sampling Online performance profile overhead compensation

Portable and scalable parallel tracing solution Trace translation to EPILOG, VTF3, and OTF Trace streams (OTF) and hierarchical trace merging

Robust timing and hardware performance support Multiple counters (hardware, user-defined, system) Measurement specification separate from instrumentation


TAU Measurement Mechanisms

Parallel profiling Function-level, block-level, statement-level Supports user-defined events and mapping events TAU parallel profile stored (dumped) during execution Support for flat, callgraph/callpath, phase profiling Support for memory profiling (headroom, leaks)

Tracing All profile-level events Inter-process communication events Inclusion of multiple counter data in traced events

Compile-time and runtime measurement selection


Performance Analysis and Visualization

Analysis of parallel profile and trace measurement Parallel profile analysis

ParaProf: parallel profile analysis and presentation ParaVis: parallel performance visualization package Profile generation from trace data (tau2pprof)

Performance data management framework (PerfDMF) Parallel trace analysis

Translation to VTF (V3.0), EPILOG, OTF formats Integration with VNG (Technical University of Dresden)

Online parallel analysis and visualization Integration with CUBE browser (KOJAK, UTK, FZJ)


TAU and DyninstAPI TAU has had a long-term affair Dyninst technology Dyninst offered a binary-level instrumentation tool

Could help in cases when the source code is unavailable Could allow instrumentation without recompilation

TAU requirements Instrument HPC applications with TAU measurements Multiple paradigms, languages, compilers, platforms Portability

Tested Dyninst features as they were released Issues

MPI, threading, availability, binary rewriting It been on/off open relationship


Using DyninstAPI

TAU uses DyninstAPI for binary code patching Pre-execution

versus at any point during execution Methods

runtime before the application begins binary rewriting

tau_run (mutator) Loads TAU measurement library Uses DyninstAPI to instrument mutatee Can apply instrumentation selection


Using DyninstAPI with TAU

Configure TAU with Dyninst and build <taudir>/<arch>/bin/tau_run

% configure –dyninst=/usr/local/dyninstAPI-5.0.1

% make clean; make install

tau_run command

% tau_run [<-o outfile>] [-Xrun<libname>][-f <select_inst_file>] [-v] <infile>

Instrument all events with TAU measurement library and execute:

% tau_run klargest

Instrument all events with TAU+PAPI measurements (libTAUsh-papi.so) and execute:

% tau_run -XrunTAUsh-papi a.out

Instruments only events specified in select.tau instrumentation specification file and execute:

% tau_run -f select.tau a.out

Binary rewriting:

% tau_run –o a.inst.out a.out


Runtime Instrumentation with DyninstAPI

tau_run loads TAU’s shared object in the address space Selects routines to be instrumented Calls DyninstAPI OneTimeCode

Register a startup routine Pass a string of routine (event) names

“main | foo | bar” IDs assigned to events

TAU’s hooks for entry/exit used for instrumentation Invoked during execution


Using DyninstAPI with MPI

One mutator per mutatee

Each mutator instruments mutatee prior to execution No central control Each mutatee writes its own performance data to disk

% mpirun -np 4 ./run.sh

% cat run.sh

#!/bin/sh

/usr/local/tau-2.x/x86_64/bin/tau_run <path>/a.out


Binary Rewriting with TAU

Rewrite binary (Save the world) before executing

No central control No need to re-instrument the code on all backend nodes Each mutatee writes its own performance data to disk

% tau_run -o a.inst.out a.out

% cd _dyninstsaved0

% mpirun -np 4 ./a.inst.out


Example

EK-SIMPLE benchmark CFD benchmark Andy Shaw, Kofi Fynn Adapted by Brad Chamberlain

Experimentation Run on 4 cpus Runtime instrumentation using DyninstAPI and tau_run Measure wallclock time and CPU time experiments Profiling and tracing modes of measurement

Look at performance data with Paraprof and Vampir


ParaProf - Main Window (4 cpus)


ParaProf - Indivdual Profile (n,c,t 0,0,0)


ParaProf - Statistics Table (Mean)


ParaProf - net_recv (MPI rank 1)


Integrated Instrumentation (Source + Dyninst)

Use source instrumentation for some events Use Dyninst for other events Access same TAU measurement infrastructure Demonstrate on matrix multiplication example

Compare regular versus strip-mining versionsSource instrumented

Source + binary instrumented


TAU-over-MRNET (ToM) Project

MRNET

as a

Transport Substrate

in

TAU

(Reporting early work done in the last week.)


TAU Transport Substrate - Motivations Transport Substrate

Enables movement of measurement-related data TAU, in the past, has relied on shared file-system

Some Modes of Performance Observation Offline / Post-mortem observation and analysis

least requirements for a specialized transport Online observation

long running applications, especially at scale dumping to file-system can be suboptimal

Online observation with feedback into application in addition, requires that the transport is bi-directional

Performance observation problems and requirements are a function of the mode


Requirements Improve performance of transport

NFS can be slow and variable Specialization and remoting of FS-operations to front-end

Data Reduction At scale, cost of moving data too high Sample in different domain (node-wise, event-wise)

Control Selection of events, measurement technique, target nodes What data to output, how often and in what form? Feedback into the measurement system, feedback into application

Online, distributed processing of generated performance data Use compute resource of transport nodes Global performance analyses within the topology Distribute statistical analyses

easy (mean, variance, histogram), challenging (clustering)


Approach and First Prototype

Measurement and measured data transport are separate No such distinction in TAU

Created abstraction to separate and hide transport TauOutput

Did not create a custom transport for TAU Use existing monitoring/transport capabilities

Supermon (Sottile and Minnich, LANL) Piggy-backed TAU performance data on Supermon

channels Correlate system-level metrics from Supermon with TAU

application performance data


Rationale

Moved away from NFS Separation of concerns

Scalability, portability, robustness Addressed independent of TAU

Re-use existing technologies where appropriate Multiple bindings

Use different solutions best suited to particular platform Implementation speed

Easy, fast to create adapter that binds to existing transport MRNET support was added in about a week

Says a lot about usability of MRNET


ToM Architecture TAU Components*

Front-End (FE) Filters Back-End (BE)

* Over MRNet API No-Backend-Instantiation

mode Push-Pull model of data

retrieval No daemon

Instrumented application contains TAU and Back-End

Two channels (streams) Data (BE to FE) Control (FE to BE)


ToM Architecture Applicaton calls into TAU

Per-Iteration explicit call to output routine

Periodic calls using alarm TauOutput object invoked

Configuration specific:compile or runtime

One per thread TauOutput mimics subset of FS-style

operations Avoids changes to TAU code If required rest of TAU can be

made aware of output type Non-blocking recv for control Back-end pushes Sink pulls


Simple Example (NPB LU - A, Per-5 iterations)

Exclusivetime


Simple Example (NPB LU - A, Per-5 iterations)

Numberof calls


Comparing ToM with NFS TAUoverNFS versus TAUoverMRNET 250 ssor iterations 251 TAU_DB_DUMP operations

Significant advantages with specialized transport substrate Similar when using Supermon as the substrate

Remoting of expensive FS meta-data operations to Front-End

NPB LU (A)

32 Processors

Over NFS (secs)

Over MRNET(secs)

% Improvementover NFS

Total Runtime 45.51 36.50 19.79

TAU_DB_DUMP 8.58 0.366 95.73


Playing with Filters Downstream (FE to BE) multicast path

Even without filters, is very useful for control Data Reduction Filters are integral to Upstream path (BE

to FE) W/O filters loss-less data reproduced D-1 times Unnecessary large cost to network

Filter 1: Random Sampling Filter Very simplistic data reduction by node-wise sampling Accepts or Rejects packets probabilistically TAU Front-End can control probability P(accept)

P(accept)=K/N (N = # leafs, K is user constant) Bounds number of packets per-round to K


Filter 1 in Action (Ring application) Compare different P(accept) values

1, 1/4, 1/16 Front-End unable to keep up

Queuing delay propagated back


Other Filters Statistics filter

Reduce raw performance data to smaller set of statistics Distribute these statistical analyses from Front-End to the filters Simple measures - mean, std.dev, histograms More sophisticated measures - distributed clustering

Controlling filters No direct way to control Upstream-filters

not on control path Recommended solution

place upstream filters that work in concert with downstream filters to share control information

requires synchronization of state between upstream and downstream filters Our Echo hack

Back-Ends transparently echo Filter-Control packets back upstream this is then interpreted by the filters easier to implement control response time may be greater


Feedback / Suggestions Easy to integrate with MRNET

Good examples documentation, readable source code Setup phase

Make MRNET intermediate nodes listen on pre-specified port Allow arbitrary mrnet-ranks to connect and then set the Ids in the

topology Relaxing strict apriori-ranks can make setup easier Setup in Job-Q environments difficult

Packetization API can be more flexible Current API is usable and simple (var-arg printf style) Composing a packet over a series of staggered stages difficult

Allow control over how buffering is performed Important in a push-pull model as data injection points (rates)

independent of data retrieval Is not a problem in a purely pull model


TAUoverMRNET - Contrast TAUoverSupermon Supermon (cluster-monitor) vs. MRNet (reduction-network)

Both light-weight transport substrates Data format

Supermon: ascii s-expressions MRNET: packets with packed (binary?) data

Supermon Setup Loose topology No support/help in setting up intermediate nodes Assume Supermon is part of the environment

MRNET Setup Strict topology Better support for starting intermediate nodes With/Without Back-End instantiation (TAU uses latter)

Multiple Front-Ends (or sinks) possible with Supermon MRNET, front-end needs to program this functionality No exisiting pluggable “filter” support in Supermon Performing aggregation is more difficult with Supermon. Supermons allows buffer-policy specification, MRNET does not


Future Work

Dyninst Tighter integration of source and binary instrumentation Conveying of source information to binary level Enabling use of TAU’s advanced measurement features Leveraging TAU’s performance mapping support Want robust and portable binary rewriting tool

MRNet Development of more performance filters Evaluation of MRNet performance for different scenarios Testing at large scale Use in applications

Download - Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu Department of Computer and Information Science Performance

Top Related