allen d. malony, aroon nataraj {malony,anataraj}@cs.uoregon.edu department of computer and...

40
Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu http://www.cs.uoregon.edu/research/tau Department of Computer and Information Science Performance Research Laboratory University of Oregon TAU Meets Dyninst and MRNet: A Long-term and Short-term Affair

Upload: kristian-atkinson

Post on 02-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu

http://www.cs.uoregon.edu/research/tau

Department of Computer and Information Science

Performance Research Laboratory

University of Oregon

TAU Meets Dyninst and MRNet:A Long-term and Short-term Affair

Page 2: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 2

Performance Research Lab

Dr. Sameer Shende, Senior scientist Alan Morris, Senior software engineer Wyatt Spear, Software engineer Scott Biersdorff, Software engineer Li Li, Ph.D. student

“Model-based Automatic Performance Diagnosis” Ph.D. thesis, January 2007

Kevin Huck, Ph.D. student Aroon Nataraj, Ph.D. student

Integrated kernel / application performance analysis Scalable performance monitoring

Page 3: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 4

TAU Performance System Tuning and Analysis Utilities (14+ year project effort) Performance system framework for HPC systems

Integrated, scalable, flexible, and parallel Multiple parallel programming paradigms Parallel performance mapping methodology

Portable (open source) parallel performance system Instrumentation, measurement, analysis, and visualization Portable performance profiling and tracing facility Performance data management and data mining

Scalable (very large) parallel performance analysis Partners

Research Center Jülich, LLNL, ANL, LANL, UTK

Page 4: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 5

TAU Performance Observation Methodology

Advocate event-based, direct performance observation Observe execution events

Types: control flow, state-based, user-defined Modes: atomic, interval (enter/exit)

Instrument program code directly (defines events) Modify program code at points of event occurrence Different code forms (source, library, object, binary, VM)

Measurement code inserted (instantiates events) Make events visible Measures performance related to event occurrence

Contrast with event-based sampling

Page 5: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 6

TAU Performance System Architecture

Page 6: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 7

TAU Performance System Architecture

Page 7: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 8

User-level abstractions problem domain

source code

source code

object code libraries

instrumentation

instrumentation

executable

runtime image

compiler

linker

OS

VM

instrumentation

instrumentation

instrumentation

instrumentation

instrumentation

instrumentationperformancedata run

preprocessor

Multi-Level Instrumentation and Mapping

Multiple interfaces Information sharing

Between interfaces Event selection

Within levels Between levels

Mapping Performance data is

associated with high-level semantic abstractions

Page 8: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 9

TAU Instrumentation Approach

Support for standard program events Routines, classes and templates Statement-level blocks and loops

Support for user-defined events Begin/End events (“user-defined timers”) Atomic events (e.g., size of memory allocated/freed) Selection of event statistics

Support definition of “semantic” entities for mapping Support for event groups (aggregation, selection) Instrumentation selection and optimization

Instrumentation enabling/disabling and runtime throttling

Page 9: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 10

TAU Instrumentation Mechanisms Source code

Manual (TAU API, TAU component API) Automatic (robust)

C, C++, F77/90/95 (Program Database Toolkit (PDT)) OpenMP (directive rewriting (Opari), POMP2 spec)

Object code Pre-instrumented libraries (e.g., MPI using PMPI) Statically-linked and dynamically-linked

Executable code Dynamic instrumentation (pre-execution) (DyninstAPI) Virtual machine instrumentation (e.g., Java using JVMPI)

TAU_COMPILER to automate instrumentation process

Page 10: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 11

TAU Measurement Approach

Portable and scalable parallel profiling solution Multiple profiling types and options Event selection and control (enabling/disabling, throttling) Online profile access and sampling Online performance profile overhead compensation

Portable and scalable parallel tracing solution Trace translation to EPILOG, VTF3, and OTF Trace streams (OTF) and hierarchical trace merging

Robust timing and hardware performance support Multiple counters (hardware, user-defined, system) Measurement specification separate from instrumentation

Page 11: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 12

TAU Measurement Mechanisms

Parallel profiling Function-level, block-level, statement-level Supports user-defined events and mapping events TAU parallel profile stored (dumped) during execution Support for flat, callgraph/callpath, phase profiling Support for memory profiling (headroom, leaks)

Tracing All profile-level events Inter-process communication events Inclusion of multiple counter data in traced events

Compile-time and runtime measurement selection

Page 12: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 13

Performance Analysis and Visualization

Analysis of parallel profile and trace measurement Parallel profile analysis

ParaProf: parallel profile analysis and presentation ParaVis: parallel performance visualization package Profile generation from trace data (tau2pprof)

Performance data management framework (PerfDMF) Parallel trace analysis

Translation to VTF (V3.0), EPILOG, OTF formats Integration with VNG (Technical University of Dresden)

Online parallel analysis and visualization Integration with CUBE browser (KOJAK, UTK, FZJ)

Page 13: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 14

TAU and DyninstAPI TAU has had a long-term affair Dyninst technology Dyninst offered a binary-level instrumentation tool

Could help in cases when the source code is unavailable Could allow instrumentation without recompilation

TAU requirements Instrument HPC applications with TAU measurements Multiple paradigms, languages, compilers, platforms Portability

Tested Dyninst features as they were released Issues

MPI, threading, availability, binary rewriting It been on/off open relationship

Page 14: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 15

Using DyninstAPI

TAU uses DyninstAPI for binary code patching Pre-execution

versus at any point during execution Methods

runtime before the application begins binary rewriting

tau_run (mutator) Loads TAU measurement library Uses DyninstAPI to instrument mutatee Can apply instrumentation selection

Page 15: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 16

Using DyninstAPI with TAU

Configure TAU with Dyninst and build <taudir>/<arch>/bin/tau_run

% configure –dyninst=/usr/local/dyninstAPI-5.0.1

% make clean; make install

tau_run command

% tau_run [<-o outfile>] [-Xrun<libname>][-f <select_inst_file>] [-v] <infile>

Instrument all events with TAU measurement library and execute:

% tau_run klargest

Instrument all events with TAU+PAPI measurements (libTAUsh-papi.so) and execute:

% tau_run -XrunTAUsh-papi a.out

Instruments only events specified in select.tau instrumentation specification file and execute:

% tau_run -f select.tau a.out

Binary rewriting:

% tau_run –o a.inst.out a.out

Page 16: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 17

Runtime Instrumentation with DyninstAPI

tau_run loads TAU’s shared object in the address space Selects routines to be instrumented Calls DyninstAPI OneTimeCode

Register a startup routine Pass a string of routine (event) names

“main | foo | bar” IDs assigned to events

TAU’s hooks for entry/exit used for instrumentation Invoked during execution

Page 17: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 18

Using DyninstAPI with MPI

One mutator per mutatee

Each mutator instruments mutatee prior to execution No central control Each mutatee writes its own performance data to disk

% mpirun -np 4 ./run.sh

% cat run.sh

#!/bin/sh

/usr/local/tau-2.x/x86_64/bin/tau_run <path>/a.out

Page 18: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 19

Binary Rewriting with TAU

Rewrite binary (Save the world) before executing

No central control No need to re-instrument the code on all backend nodes Each mutatee writes its own performance data to disk

% tau_run -o a.inst.out a.out

% cd _dyninstsaved0

% mpirun -np 4 ./a.inst.out

Page 19: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 20

Example

EK-SIMPLE benchmark CFD benchmark Andy Shaw, Kofi Fynn Adapted by Brad Chamberlain

Experimentation Run on 4 cpus Runtime instrumentation using DyninstAPI and tau_run Measure wallclock time and CPU time experiments Profiling and tracing modes of measurement

Look at performance data with Paraprof and Vampir

Page 20: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 21

ParaProf - Main Window (4 cpus)

Page 21: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 22

ParaProf - Indivdual Profile (n,c,t 0,0,0)

Page 22: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 23

ParaProf - Statistics Table (Mean)

Page 23: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 24

ParaProf - net_recv (MPI rank 1)

Page 24: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 25

Integrated Instrumentation (Source + Dyninst)

Use source instrumentation for some events Use Dyninst for other events Access same TAU measurement infrastructure Demonstrate on matrix multiplication example

Compare regular versus strip-mining versionsSource instrumented

Source + binary instrumented

Page 25: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 26

TAU-over-MRNET (ToM) Project

MRNET

as a

Transport Substrate

in

TAU

(Reporting early work done in the last week.)

Page 26: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 27

TAU Transport Substrate - Motivations Transport Substrate

Enables movement of measurement-related data TAU, in the past, has relied on shared file-system

Some Modes of Performance Observation Offline / Post-mortem observation and analysis

least requirements for a specialized transport Online observation

long running applications, especially at scale dumping to file-system can be suboptimal

Online observation with feedback into application in addition, requires that the transport is bi-directional

Performance observation problems and requirements are a function of the mode

Page 27: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 28

Requirements Improve performance of transport

NFS can be slow and variable Specialization and remoting of FS-operations to front-end

Data Reduction At scale, cost of moving data too high Sample in different domain (node-wise, event-wise)

Control Selection of events, measurement technique, target nodes What data to output, how often and in what form? Feedback into the measurement system, feedback into application

Online, distributed processing of generated performance data Use compute resource of transport nodes Global performance analyses within the topology Distribute statistical analyses

easy (mean, variance, histogram), challenging (clustering)

Page 28: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 29

Approach and First Prototype

Measurement and measured data transport are separate No such distinction in TAU

Created abstraction to separate and hide transport TauOutput

Did not create a custom transport for TAU Use existing monitoring/transport capabilities

Supermon (Sottile and Minnich, LANL) Piggy-backed TAU performance data on Supermon

channels Correlate system-level metrics from Supermon with TAU

application performance data

Page 29: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 30

Rationale

Moved away from NFS Separation of concerns

Scalability, portability, robustness Addressed independent of TAU

Re-use existing technologies where appropriate Multiple bindings

Use different solutions best suited to particular platform Implementation speed

Easy, fast to create adapter that binds to existing transport MRNET support was added in about a week

Says a lot about usability of MRNET

Page 30: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 31

ToM Architecture TAU Components*

Front-End (FE) Filters Back-End (BE)

* Over MRNet API No-Backend-Instantiation

mode Push-Pull model of data

retrieval No daemon

Instrumented application contains TAU and Back-End

Two channels (streams) Data (BE to FE) Control (FE to BE)

Page 31: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 32

ToM Architecture Applicaton calls into TAU

Per-Iteration explicit call to output routine

Periodic calls using alarm TauOutput object invoked

Configuration specific:compile or runtime

One per thread TauOutput mimics subset of FS-style

operations Avoids changes to TAU code If required rest of TAU can be

made aware of output type Non-blocking recv for control Back-end pushes Sink pulls

Page 32: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 33

Simple Example (NPB LU - A, Per-5 iterations)

Exclusivetime

Page 33: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 34

Simple Example (NPB LU - A, Per-5 iterations)

Numberof calls

Page 34: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 35

Comparing ToM with NFS TAUoverNFS versus TAUoverMRNET 250 ssor iterations 251 TAU_DB_DUMP operations

Significant advantages with specialized transport substrate Similar when using Supermon as the substrate

Remoting of expensive FS meta-data operations to Front-End

NPB LU (A)

32 Processors

Over NFS (secs)

Over MRNET(secs)

% Improvementover NFS

Total Runtime 45.51 36.50 19.79

TAU_DB_DUMP 8.58 0.366 95.73

Page 35: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 36

Playing with Filters Downstream (FE to BE) multicast path

Even without filters, is very useful for control Data Reduction Filters are integral to Upstream path (BE

to FE) W/O filters loss-less data reproduced D-1 times Unnecessary large cost to network

Filter 1: Random Sampling Filter Very simplistic data reduction by node-wise sampling Accepts or Rejects packets probabilistically TAU Front-End can control probability P(accept)

P(accept)=K/N (N = # leafs, K is user constant) Bounds number of packets per-round to K

Page 36: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 37

Filter 1 in Action (Ring application) Compare different P(accept) values

1, 1/4, 1/16 Front-End unable to keep up

Queuing delay propagated back

Page 37: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 38

Other Filters Statistics filter

Reduce raw performance data to smaller set of statistics Distribute these statistical analyses from Front-End to the filters Simple measures - mean, std.dev, histograms More sophisticated measures - distributed clustering

Controlling filters No direct way to control Upstream-filters

not on control path Recommended solution

place upstream filters that work in concert with downstream filters to share control information

requires synchronization of state between upstream and downstream filters Our Echo hack

Back-Ends transparently echo Filter-Control packets back upstream this is then interpreted by the filters easier to implement control response time may be greater

Page 38: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 39

Feedback / Suggestions Easy to integrate with MRNET

Good examples documentation, readable source code Setup phase

Make MRNET intermediate nodes listen on pre-specified port Allow arbitrary mrnet-ranks to connect and then set the Ids in the

topology Relaxing strict apriori-ranks can make setup easier Setup in Job-Q environments difficult

Packetization API can be more flexible Current API is usable and simple (var-arg printf style) Composing a packet over a series of staggered stages difficult

Allow control over how buffering is performed Important in a push-pull model as data injection points (rates)

independent of data retrieval Is not a problem in a purely pull model

Page 39: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 40

TAUoverMRNET - Contrast TAUoverSupermon Supermon (cluster-monitor) vs. MRNet (reduction-network)

Both light-weight transport substrates Data format

Supermon: ascii s-expressions MRNET: packets with packed (binary?) data

Supermon Setup Loose topology No support/help in setting up intermediate nodes Assume Supermon is part of the environment

MRNET Setup Strict topology Better support for starting intermediate nodes With/Without Back-End instantiation (TAU uses latter)

Multiple Front-Ends (or sinks) possible with Supermon MRNET, front-end needs to program this functionality No exisiting pluggable “filter” support in Supermon Performing aggregation is more difficult with Supermon. Supermons allows buffer-policy specification, MRNET does not

Page 40: Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu  Department of Computer and Information Science Performance

TAU Meets Dyninst and MRNetParaDyn/Condor Week 2007 41

Future Work

Dyninst Tighter integration of source and binary instrumentation Conveying of source information to binary level Enabling use of TAU’s advanced measurement features Leveraging TAU’s performance mapping support Want robust and portable binary rewriting tool

MRNet Development of more performance filters Evaluation of MRNet performance for different scenarios Testing at large scale Use in applications