allen d. malony sameer s. shende robert bell {malony,shende,bell}@cs.uoregon.edu department of...

94
Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute University of Oregon The TAU Performance System

Upload: esmond-baldwin

Post on 05-Jan-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu

Department of Computer and Information Science

Computational Science Institute

University of Oregon

The TAU Performance System

Page 2: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 20022

Overview

Motivation and goals TAU architecture and toolkit

Instrumentation Measurement Analysis

Performance mapping Application case studies

… TAU Integration Work in progress Conclusions

Page 3: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 20023

Motivation

Tools for performance problem solving Empirical-based performance optimization process

characterization

PerformanceTuning

PerformanceDiagnosis

PerformanceExperimentation

PerformanceObservation

hypotheses

properties

• Instrumentation• Measurement• Analysis• Visualization

PerformanceTechnology

Page 4: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 20024

Problems

Diverse performance observability requirements Multiple levels of software and hardware Different types and detail of performance data Alternative performance problem solving methods Multiple targets of software and system application

Demands more robust performance technology Broad scope of performance observation Flexible and configurable mechanisms Technology integration and extension Cross-platform portability Open, layered, and modular framework architecture

Page 5: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 20025

Complexity Challenges for Performance Tools

Computing system environment complexity Observation integration and optimization Access, accuracy, and granularity constraints Diverse/specialized observation capabilities/technology Restricted modes limit performance problem solving

Sophisticated software development environments Programming paradigms and performance models Performance data mapping to software abstractions Uniformity of performance abstraction across platforms Rich observation capabilities and flexible configuration Common performance problem solving methods

Page 6: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 20026

General Problems (Performance Technology)

How do we create robust and ubiquitous performance technology for the analysis and tuning of parallel and distributed software and systems in the presence of (evolving) complexity challenges?

How do we apply performance technology effectively for the variety and diversity of performance problems

that arise in the context of complex parallel and distributed computer systems?

Page 7: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 20027

Computation Model for Performance Technology

How to address dual performance technology goals? Robust capabilities + widely available methods Contend with problems of system diversity Flexible tool composition/configuration/integration

Approaches Restrict computation types / performance problems

machines, languages, instrumentation technique, … limited performance technology coverage and application

Base technology on abstract computation model general architecture and software execution features map features/methods to existing complex system types develop capabilities that can be adapted and optimized

Page 8: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 20028

General Complex System Computation Model

Node: physically distinct shared memory machine Message passing node interconnection network

Context: distinct virtual memory space within node Thread: execution threads (user/system) in context

memory memory

Node Node Node

VMspace

Context

SMP

Threads

node memory

Interconnection Network Inter-node messagecommunication

*

*

physicalview

modelview

Page 9: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 20029

TAU Performance System

Tuning and Analysis Utilities Performance system framework for scalable parallel and

distributed high-performance computing Targets a general complex system computation model

nodes / contexts / threads Multi-level: system / software / parallelism Measurement and analysis abstraction

Integrated toolkit for performance instrumentation, measurement, analysis, and visualization Portable performance profiling and tracing facility Open software approach with technology integration

University of Oregon , Forschungszentrum Jülich, LANL

Page 10: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200210

TAU Performance System Architecture

EPILOG

Paraver

Page 11: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200211

TAU Performance Systems Goals

Multi-level performance instrumentation Multi-language automatic source instrumentation

Flexible and configurable performance measurement Widely-ported parallel performance profiling system

Computer system architectures and operating systems Different programming languages and compilers

Support for multiple parallel programming paradigms Multi-threading, message passing, mixed-mode, hybrid

Support for performance mapping Support for object-oriented and generic programming Integration in complex software systems and applications

Page 12: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200212

How To Use TAU?

Instrumentation Application code and libraries Selective instrumentation

Install, compile, and link with TAU measurement library % configure; make clean install Multiple configurations for different measurements options Does not require change in instrumentation Selective measurement control

Execute “experiments” to produce performance data Performance data generated at end or during execution

Use analysis tools to look at performance results

Page 13: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200213

TAU Instrumentation Approach

Support for standard program events Routines Classes and templates Statement-level blocks

Support for user-defined events Begin/End events (“user-defined timers”) Atomic events Selection of event statistics

Support definition of “semantic” entities for mapping Support for event groups Instrumentation optimization

Page 14: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200214

TAU Instrumentation

Flexible instrumentation mechanisms at multiple levels Source code

manual automatic

C, C++, F77/90 (Program Database Toolkit (PDT))OpenMP (directive rewriting (Opari))

Object code pre-instrumented libraries (e.g., MPI using PMPI) statically-linked and dynamically-linked fast breakpoints (compiler generated)

Executable code dynamic instrumentation (pre-execution) (DynInstAPI) virtual machine instrumentation (e.g., Java using JVMPI)

Page 15: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200215

Multi-Level Instrumentation

Targets common measurement interface TAU API

Multiple instrumentation interfaces Simultaneously active

Information sharing between interfaces Utilizes instrumentation knowledge between levels

Selective instrumentation Available at each level Cross-level selection

Targets a common performance model Presents a unified view of execution

Consistent performance events

Page 16: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200216

Program Database Toolkit (PDT)

Program code analysis framework develop source-based tools

High-level interface to source code information Integrated toolkit for source code parsing, database

creation, and database query Commercial grade front-end parsers Portable IL analyzer, database format, and access API Open software approach for tool development

Multiple source languages Implement automatic performance instrumentation tools

tau_instrumentor

Page 17: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200217

PDT Architecture and Tools

Application/ Library

C / C++parser

Fortran 77/90parser

C / C++IL analyzer

Fortran 77/90IL analyzer

ProgramDatabase

Files

IL IL

DUCTAPE

PDBhtml

SILOON

CHASM

TAU_instr

Programdocumentation

Applicationcomponent glue

C++ / F90interoperability

Automatic sourceinstrumentation

Page 18: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200218

PDT Components

Language front end Edison Design Group (EDG): C, C++, Java Mutek Solutions Ltd.: F77, F90

IL Analyzer Processes intermediate language (IL) tree from front-end Creates “program database” (PDB) formatted file

DUCTAPE (Bernd Mohr, FZJ/ZAM, Germany) C++ program Database Utilities and Conversion Tools

APplication Environment Processes and merges PDB files C++ library to access the PDB for PDT applications

Page 19: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200219

Instrumentation Control

Selection of which performance events to observe Could depend on scope, type, level of interest Could depend on instrumentation overhead

How is selection supported in instrumentation system? No choice Include / exclude lists (TAU) Environment variables Static vs. dynamic

Controlling the instrumentation of small routines High relative measurement overhead Significant intrusion and possible perturbation

Page 20: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200220

Selective Instrumentation% tau_instrumentor

Usage : tau_instrumentor <pdbfile> <sourcefile> [-o <outputfile>] [-noinline] [-g groupname] [-i headerfile] [-c|-c++|-fortran] [-f <instr_req_file> ]

For selective instrumentation, use –f option

% cat selective.dat

# Selective instrumentation: Specify an exclude/include list.

BEGIN_EXCLUDE_LIST

void quicksort(int *, int, int)

void sort_5elements(int *)

void interchange(int *, int *)

END_EXCLUDE_LIST

# If an include list is specified, the routines in the list will be the only

# routines that are instrumented.

# To specify an include list (a list of routines that will be instrumented)

# remove the leading # to uncomment the following lines

#BEGIN_INCLUDE_LIST

#int main(int, char **)

#int select_

#END_INCLUDE_LIST

Page 21: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200221

Overhead Analysis for Automatic Selection

Analyze the performance data to determine events with high (relative) overhead performance measurements

Create a select list for excluding those events Rule grammar (used in tau_reduce tool)

[GroupName:] Field Operator Number GroupName indicates rule applies to events in group Field is a event metric attribute (from profile statistics)

numcalls, numsubs, percent, usec, cumusec, count, totalcount, stdev, usecs/call, counts/call

Operator is one of >, <, or = Number is any number Compound rules possible using “&” between simple rules

Page 22: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200222

Example Rules

#Exclude all events that are members of TAU_USER #and use less than 1000 microseconds

TAU_USER:usec < 1000 #Exclude all events that have less than 100

#microseconds and are called only onceusec < 1000 & numcalls = 1

#Exclude all events that have less than 1000 usecs per #call OR have a (total inclusive) percent less than 5

usecs/call < 1000percent < 5

Scientific notation can be used

Page 23: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200223

TAU Measurement

Performance information Performance events High-resolution timer library (real-time / virtual clocks) General software counter library (user-defined events) Hardware performance counters

PCL (Performance Counter Library) (ZAM, Germany) PAPI (Performance API) (UTK, Ptools Consortium) consistent, portable API

Organization Node, context, thread levels Profile groups for collective events (runtime selective) Performance data mapping between software levels

Page 24: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200224

TAU Measurement Options

Parallel profiling Function-level, block-level, statement-level Supports user-defined events TAU parallel profile data stored during execution Hardware counts values Support for multiple counters Support for callpath profiling

Tracing All profile-level events Inter-process communication events Timestamp synchronization Trace merging and format conversion

Page 25: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200225

TAU Measurement System Configuration

configure [OPTIONS] {-c++=<CC>, -cc=<cc>} Specify C++ and C compilers {-pthread, -sproc , -smarts} Use pthread, SGI sproc, smarts threads -openmp Use OpenMP threads -opari=<dir> Specify location of Opari OpenMP tool {-papi ,-pcl=<dir> Specify location of PAPI or PCL -pdt=<dir> Specify location of PDT {-mpiinc=<d>, mpilib=<d>} Specify MPI library instrumentation -TRACE Generate TAU event traces -PROFILE Generate TAU profiles -PROFILECALLPATH Generate Callpath profiles (1-level) -MULTIPLECOUNTERS Use more than one hardware counter -CPUTIME Use usertime+system time -PAPIWALLCLOCK Use PAPI to access wallclock time -PAPIVIRTUAL Use PAPI for virtual (user) time …

Page 26: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200226

TAU Measurement API

Initialization and runtime configuration TAU_PROFILE_INIT(argc, argv);

TAU_PROFILE_SET_NODE(myNode);TAU_PROFILE_SET_CONTEXT(myContext);TAU_PROFILE_EXIT(message);TAU_REGISTIER_THREAD();

Function and class methods TAU_PROFILE(name, type, group);

Template TAU_TYPE_STRING(variable, type);

TAU_PROFILE(name, type, group);CT(variable);

User-defined timing TAU_PROFILE_TIMER(timer, name, type, group);

TAU_PROFILE_START(timer);TAU_PROFILE_STOP(timer);

Page 27: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200227

TAU Measurement API (continued)

User-defined events TAU_REGISTER_EVENT(variable, event_name);

TAU_EVENT(variable, value);TAU_PROFILE_STMT(statement);

Mapping TAU_MAPPING(statement, key);

TAU_MAPPING_OBJECT(funcIdVar);TAU_MAPPING_LINK(funcIdVar, key);

TAU_MAPPING_PROFILE (funcIdVar);TAU_MAPPING_PROFILE_TIMER(timer, funcIdVar);TAU_MAPPING_PROFILE_START(timer);TAU_MAPPING_PROFILE_STOP(timer);

Reporting TAU_REPORT_STATISTICS();

TAU_REPORT_THREAD_STATISTICS();

Page 28: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200228

Grouping Performance Data in TAU

Profile groups A group of related routines forms a profile group Statically defined

TAU_DEFAULT, TAU_USER[1-5], TAU_MESSAGE, TAU_IO, …

Dynamically defined group name based on string, such as “adlib” or “particles” runtime lookup in a map to get unique group identifier uses tau_instrumentor to instrument

Ability to change group names at runtime Group-based instrumentation and measurement control

Page 29: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200229

TAU Group Instrumentation Control API

Enabling profile groups TAU_ENABLE_INSTRUMENTATION(); TAU_ENABLE_GROUP(TAU_GROUP); TAU_ENABLE_GROUP_NAME(“group name”); TAU_ENABLE_ALL_GROUPS();

Disabling profile groups TAU_DISABLE_INSTRUMENTATION(); TAU_DISABLE_GROUP(TAU_GROUP); TAU_DISABLE_GROUP_NAME(); TAU_DISABLE_ALL_GROUPS();

Obtaining profile group identifier Runtime switching of profile groups

Page 30: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200230

TAU Pre-execution Control

Dynamic groups defined at file scope Group names and group associations runtime modifiable Controlling groups at pre-execution time

--profile <group1+group2+…+groupN> option

% tau_instrumentor app.pdb app.cpp \–o app.i.cpp –g “particles”

% mpirun –np 4 application \–profile particles+field+mesh+io

Examples: POOMA (LANL) uses static groups VTF (Caltech) uses dynamic group in Python-based

execution instrumentation control

Page 31: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200231

Configuring TAU Measurement Library

Profiling with wallclock time (on a quad PIII Linux machine) % configure -mpiinc=/usr/local/packages/mpich/include

-mpilib=/usr/local/packages/mpich/lib -pdt=/usr/pkg/pdtoolkit/ -useropt=-O2 -LINUXTIMERS

Tracing % configure -mpiinc=/usr/local/packages/mpich/include

-mpilib=/usr/local/packages/mpich/lib -pdt=/usr/pkg/pdtoolkit -useropt=-O2 -LINUXTIMERS

Profiling with PAPI % configure -mpiinc=/usr/local/packages/mpich/include

-mpilib=/usr/local/packages/mpich/lib -pdt=/usr/pkg/pdtoolkit/ -useropt=-O2 -papi=/usr/local/packages/papi % setenv PAPI_EVENT PAPI_FP_INS % setenv PAPI_EVENT PAPI_L1_DCM

Page 32: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200233

TAU Analysis

Parallel profile analysis Pprof

parallel profiler with text-based display Racy

graphical interface to pprof (Tcl/Tk) jRacy

Java implementation of Racy with enhancements

Trace analysis and visualization Trace merging and clock adjustment (if necessary) Trace format conversion (ALOG, SDDF, VTF, Paraver) Trace visualization using Vampir (Pallas)

Page 33: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200234

Pprof Command

pprof [-c|-b|-m|-t|-e|-i] [-r] [-s] [-n num] [-f file] [-l] [nodes] -c Sort according to number of calls -b Sort according to number of subroutines called -m Sort according to msecs (exclusive time total) -t Sort according to total msecs (inclusive time total) -e Sort according to exclusive time per call -i Sort according to inclusive time per call -v Sort according to standard deviation (exclusive

usec) -r Reverse sorting order -s Print only summary profile information -n num Print only first number of functions -f file Specify full path and filename without node ids -l nodes List all functions and exit (prints only info about all

contexts/threads of given node numbers)

Page 34: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200235

Pprof Output (NAS Parallel Benchmark – LU)

Intel QuadPIII Xeon

F90 + MPICH

Profile - Node - Context - Thread

Events - code - MPI

Page 35: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200236

jRacy (NAS Parallel Benchmark – LU)

n: nodec: contextt: thread

Global profiles Routine profile across all nodes

Event legend

Individual profile

Page 36: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200237

TAU + PAPI (NAS Parallel Benchmark – LU )

Floating point operations

Re-link to alternate library

Can use multiple counter support

Page 37: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200238

TAU + Vampir (NAS Parallel Benchmark – LU)

Timeline display Callgraph display

Parallelism display

Communications display

Page 38: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200239

tau_reduce Example

tau_reduce implements overhead reduction in TAU Consider klargest example

Find kth largest element in a N elements Compare two methods: quicksort, select_kth_largest

“Un-instrumented” testcase: i = 2324, N = 1000000 quicksort: (wall clock) = 0.188511 secs select_kth_largest: (wall clock) = 0.149594 secs Total: (PIII/1.2GHz time) = 0.340u 0.020s 0:00.37

Execute with all routines instrumented Execute with rule-based selective instrumentation

usec>1000 & numcalls>400000 & usecs/call<30 & percent>25

Page 39: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200240

Simple sorting example on one processor

NODE 0;CONTEXT 0;THREAD 0:

---------------------------------------------------------------------------------------

%Time Exclusive Inclusive #Call #Subrs Inclusive Name

msec msec usec/call

---------------------------------------------------------------------------------------

100.0 13 4,982 1 4 4982030 int main

93.5 3,223 4,659 4.20241E+06 1.40268E+07 1 void quicksort

62.9 0.00481 3,134 5 5 626839 int kth_largest_qs

36.4 137 1,813 28 450057 64769 int select_kth_largest

33.6 150 1,675 449978 449978 4 void sort_5elements

28.8 1,435 1,435 1.02744E+07 0 0 void interchange

0.4 20 20 1 0 20668 void setup

0.0 0.0118 0.0118 49 0 0 int ceil

Before selective instrumentation reduction

NODE 0;CONTEXT 0;THREAD 0:---------------------------------------------------------------------------------------%Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call---------------------------------------------------------------------------------------100.0 14 383 1 4 383333 int main 50.9 195 195 5 0 39017 int kth_largest_qs 40.0 153 153 28 79 5478 int select_kth_largest 5.4 20 20 1 0 20611 void setup 0.0 0.02 0.02 49 0 0 int ceil

After selective instrumentation reduction

Page 40: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200241

TAU Performance System Status

Computing platforms IBM SP / Power4, SGI Origin 2K/3K, ASCI Red, Cray

T3E / SV-1 (X-1 planned), HP (Compaq) SC (Tru64), HP Superdome (HP-UX), Sun, Hitachi SR8000, NEX SX-5 (SX-6 underway), Linux clusters (IA-32/64, Alpha, PPC, PA-RISC, Power), Apple (OS X), Windows

Programming languages C, C++, Fortran 77, F90, HPF, Java, OpenMP, Python

Communication libraries MPI, PVM, Nexus, shmem, Tulip, ACLMPL, MPIJava

Thread libraries pthreads, SGI sproc, Java,Windows, OpenMP, SMARTS

Page 41: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200242

TAU Performance System Status (continued)

Compilers Intel KAI (KCC, KAP/Pro), PGI, GNU, Fujitsu, Sun,

Microsoft, SGI, Cray, IBM, Compaq, Hitachi, NEC, Intel Application libraries (selected)

Blitz++, A++/P++, PETSc, SAMRAI, Overture, PAWS Application frameworks (selected)

POOMA, MC++, Conejo, Uintah, VTF, UPS, GrACE Performance projects using TAU

Aurora / SCALEA: ACPC, University of Vienna TAU full distribution (Version 2.12, web download)

TAU performance system toolkit and user’s guide Automatic software installation and examples

Page 42: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200243

PDT Status

Program Database Toolkit (Version 2.2, web download) EDG C++ front end (Version 2.45.2) Mutek Fortran 90 front end (Version 2.4.1) C++ and Fortran 90 IL Analyzer DUCTAPE library Standard C++ system header files (KCC Version 4.0f)

PDT-constructed tools TAU instrumentor (C/C++/F90) Program analysis support for SILOON and CHASM

Platforms Same as for TAU with a few exceptions

Page 43: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200244

Performance Mapping

High-level semantic abstractions Associate

performance measurements

Performance mapping performance

measurement system support to assign data correctly

Page 44: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200245

New dynamic mapping scheme (SEAA) Contrast with ParaMap (Miller and Irvin) Entities defined at any level of abstraction Attribute entity with semantic information Entity-to-entity associations

Two association types (implemented in TAU API) Embedded – extends associated

object to store performancemeasurement entity

External – creates an external look-uptable using address of object as key tolocate performance measurement entity

Semantic Entities/Attributes/Associations

Page 45: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200246

Hypothetical Mapping Example Particles distributed on surfaces of a cubeParticle* P[MAX]; /* Array of particles */

int GenerateParticles() {

/* distribute particles over all faces of the cube */

for (int face=0, last=0; face < 6; face++){

/* particles on this face */

int particles_on_this_face = num(face);

for (int i=last; i < particles_on_this_face; i++) {

/* particle properties are a function of face */ P[i] = ... f(face);

...

}

last+= particles_on_this_face;

}

}

Page 46: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200247

Hypothetical Mapping Example (continued)

How much time is spent processing face i particles? What is the distribution of performance among faces?

int ProcessParticle(Particle *p) {

/* perform some computation on p */

}

int main() {

GenerateParticles();

/* create a list of particles */

for (int i = 0; i < N; i++)

/* iterates over the list */

ProcessParticle(P[i]);

}

engine

workpackets

Page 47: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200248

No Performance Mapping versus Mapping

Typical performance tools report performance with respect to routines

Does not provide support for mapping

Performance tools with SEAA mapping can observe performance with respect to scientist’s programming and problem abstractions

TAU (no mapping) TAU (w/ mapping)

Page 48: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200253

Strategies for Empirical Performance Evaluation

Empirical performance evaluation as a series of performance experiments Experiment trials describing instrumentation and

measurement requirements Where/When/How axes of empirical performance space

where are performance measurements made in program when is performance instrumentation done how are performance measurement/instrumentation chosen

Strategies for achieving flexibility and portability goals Limited performance methods restrict evaluation scope Non-portable methods force use of different techniques Integration and combination of strategies

Page 49: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200254

Case Study: SIMPLE Performance Analysis

SIMPLE hydrodynamics benchmark C code with MPI message communication Multiple instrumentation methods

source-to-source translation (PDT) MPI wrapper library level instrumentation (PMPI) pre-execution binary instrumentation (DyninstAPI)

Alternative measurement strategies statistical profiles of software actions statistical profiles of hardware actions (PCL, PAPI) program event tracing choice of time source

gettimeofday, high-res physical, CPU, process virtual

Page 50: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200255

PDT automatically generates instrumentation code names events with full function signatures

Similarly for all other routines in SIMPLE program

SIMPLE Source Instrumentation (Preprocessed)

int compute_heat_conduction(double theta_hat[X][Y], double deltat, double new_r[X][Y], double new_z[X][Y], double new_alpha[X][Y],

double new_rho[X][Y], double theta_l[X][Y],double Gamma_k[X][Y], double Gamma_l[X][Y])

{

TAU_PROFILE("int compute_heat_conduction(double (*)[259], double, double (*)[259],double (*)[259], double (*)[259], double (*)[259],

double (*)[259], double (*)[259], double (*)[259])"," ", TAU_USER);

...}

Page 51: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200256

int MPI_Send(…)...{

int returnVal, typesize;TAU_PROFILE_TIMER(tautimer, "MPI_Send()", " ", TAU_MESSAGE);TAU_PROFILE_START(tautimer);if (dest != MPI_PROC_NULL) {

PMPI_Type_size(datatype, &typesize);

TAU_TRACE_SENDMSG(tag, dest, typesize*count);}returnVal = PMPI_Send(buf, count, datatype, dest, tag, comm);TAU_PROFILE_STOP(tautimer); return returnVal;

}

MPI Library Instrumentation (MPI_Send)

Uses MPI profiling interposition library (PMPI)

Page 52: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200257

MPI Library Instrumentation (MPI_Recv)

int MPI_Recv(…)...{

int returnVal, size;TAU_PROFILE_TIMER(tautimer, "MPI_Recv()", " ", TAU_MESSAGE);TAU_PROFILE_START(tautimer);returnVal = PMPI_Recv(buf, count, datatype, src, tag, comm,

status);if (src != MPI_PROC_NULL && returnVal == MPI_SUCCESS) {

PMPI_Get_count( status, MPI_BYTE, &size );TAU_TRACE_RECVMSG(status->MPI_TAG, status->MPI_SOURCE,

size);}TAU_PROFILE_STOP(tautimer); return returnVal;

}

Page 53: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200258

Multi-Level Instrumentation (Profiling)

four processes

Profile per process

global profile

event legend

Page 54: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200259

Multi-Level Instrumentation (Tracing)

Relink with TAU library configured for tracing No modification of source instrumentation required!

TAU performance groups

Page 55: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200260

Dynamic Instrumentation of SIMPLE

Uses DynInstAPI for runtime code patching Mutator loads measurement library, instruments mutatee

One mutator (tau_run) per executable image mpirun –np <n> tau.shell

Page 56: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200261

Case Study: PETSc v2.1.3 (ANL)

Portable, Extensible Toolkit for Scientific Computation Scalable (parallel) PDE framework

Suite of data structures and routines (374,458 code lines) Solution of scientific applications modeled by PDEs

Parallel implementation MPI used for inter-process communication

TAU instrumentation PDT for C/C++ source instrumentation (100%, no manual) MPI wrapper interposition library instrumentation

Example Linear system of equations (Ax=b) (SLES) (ex2 test case) Non-linear system of equations (SNES) (ex19 test case)

Page 57: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200262

PETSc ex2 (Profile - wallclock time)

Sorted with respect to exclusive time

Page 58: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200263

PETSc ex2(Profile - overall and message counts) Observe

load balance

Track messages

Capture with user-defined events

Page 59: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200264

PETSc ex2 (Profile - percentages and time)

View per threadperformance on individual routines

Page 60: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200265

PETSc ex2 (Trace)

Page 61: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200266

PETSc ex19

Non-linear solver (SNES) 2-D driven cavity code Uses velocity-vorticity formulation Finite difference discretization on a structured grid

Problem size and measurements 56x56 mesh size on quad Pentium III (550 Mhz, Linux) Executes for approximately one minute MPI wrapper interposition library PDT (tau_instrumentor) Selective instrumentation (tau_reduce)

three routines identified with high instrumentation overhead

Page 62: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200267

PETSc ex19 (Profile - wallclock time)

Sorted by inclusive time

Sorted by exclusive time

Page 63: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200268

PETSc ex19 (Profile - overall and percentages)

Page 64: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200269

PETSc ex19 (Tracing)

Commonly seen communicaton behavior

Page 65: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200270

PETSc ex19 (Tracing - callgraph)

Page 66: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200271

PETSc ex19 (PAPI_FP_INS, PAPI_L1_DCM)

PAPI_FP_INS

PAPI_L1_DCM

Uses multiple counter profile measurement

Page 67: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200272

Case Study: Mixed-mode Parallel Programs

Portable mixed-mode parallel programming Multi-threaded shared memory programming Inter-node message passing

Performance measurement Access to runtime system and communication events Associate communication and application events

2-Dimensional Stommel model of ocean circulation OpenMP for shared memory parallel programming MPI for cross-box message-based parallelism Jacobi iteration, 5-point stencil Timothy Kaiser (San Diego Supercomputing Center)

Page 68: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200273

Stommel Instrumentation

OpenMP directive instrumentation (uses OPARI)

pomp_for_enter(&omp_rd_2);

#line 252 "stommel.c"

#pragma omp for schedule(static) reduction(+: diff) private(j) firstprivate (a1,a2,a3,a4,a5) nowait

for( i=i1;i<=i2;i++) {

for(j=j1;j<=j2;j++){

new_psi[i][j]=a1*psi[i+1][j] + a2*psi[i-1][j] + a3*psi[i][j+1]

+ a4*psi[i][j-1] - a5*the_for[i][j];

diff=diff+fabs(new_psi[i][j]-psi[i][j]);

}

}

pomp_barrier_enter(&omp_rd_2);

#pragma omp barrier

pomp_barrier_exit(&omp_rd_2);

pomp_for_exit(&omp_rd_2);

#line 261 "stommel.c"

Page 69: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200274

OpenMP + MPI Ocean Modeling (Trace)

Thread-paired message passing

Integrated OpenMP + MPI events

Page 70: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200275

OpenMP + MPI Ocean Modeling (HW Profile)

% configure -papi=../packages/papi -openmp -c++=pgCC -cc=pgcc -mpiinc=../packages/mpich/include -mpilib=../packages/mpich/lib

Integrated OpenMP + MPI events

FP instructions

Integrated OpenMP + MPI events

Page 71: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200282

Case Study: C-SAFE / Uintah

Center for Simulation of Accidental Fires & Explosions ASCI ASAP Level 1 center, University of Utah PSE for multi-model simulation high-energy explosion Coupled non-linear solvers, optimization, computational

steering, visualization, and experimental data verification Very large-scale simulations

Computer science problems: Coupling of multiple simulation codes Software engineering across diverse expert teams Achieving high performance on large-scale systems

Page 72: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200283

Example C-SAFE Simulation Problems

∑Heptane fire simulation

Material stress simulation

Typical C-SAFE simulation with a billion degrees of freedom and non-linear time dynamics

Page 73: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200284

Uintah Computational Framework (UCF)

Execution model based on software (macro) dataflow Exposes parallelism and hides data transport latency Computations expressed a directed acyclic graphs of tasks

consumes input and produces output (input to future task) input/outputs specified for each patch in a structured grid

Abstraction of global single-assignment memory DataWarehouse Directory mapping names to values (array structured) Write value once then communicate to awaiting tasks

Task graph gets mapped to processing resources Communications schedule approximates global optimal

Page 74: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200285

Performance Technology Integration

Uintah present challenges to performance integration Software diversity and structure

UCF middleware, simulation code modules component-based hierarchy

Portability objectives cross-language and cross-platform multi-parallelism: thread, message passing, mixed

Scalability objectives High-level programming and execution abstractions

Requires flexible and robust performance technology Requires support for performance mapping

Page 75: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200286

Task Execution in Uintah Parallel Scheduler

Profile methods and functions in scheduler and in MPI library

Need to map performance data!

Task execution time dominates (what task?)

MPI communication overheads (where?)

Task execution time distribution

Page 76: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200287

Uintah Task Performance Mapping Uintah partitions individual particles across processing

elements (processes or threads) Simulation tasks in task graph work on particles

Tasks have domain-specific character in the computation “interpolate particles to grid” in Material Point Method

Task instances generated for each partitioned particle set Execution scheduled with respect to task dependencies

How to attributed execution time among different tasks Assign semantic name (task type) to a task instance

SerialMPM::interpolateParticleToGrid Map TAU timer object to (abstract) task (semantic entity) Look up timer object using task type (semantic attribute)

Further partition along different domain-specific axes

Page 77: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200289

Task Performance Mapping (Profile)

Mapped task performance across processes

Performance mapping for different tasks

Page 78: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200290

Work Packet – to – Task Mapping (Trace)

Work packet computation events colored by task type

Distinct phases of computation can be identifed based on task

Page 79: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200291

Comparing Uintah Traces for Scalability Analysis

8 processes

8 processes

32 processes32 processes

32 processes

Page 80: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200292

Online Performance Analysis for C-SAFE Apps

ApplicationPerformance

Steering PerformanceVisualizer

PerformanceAnalyzer

PerformanceData Reader

TAUPerformance

System

PerformanceData Integrator

SCIRun (Univ. of Utah)

// performancedata streams

// performancedata output

file system

• sample sequencing• reader synchronization

accumulatedsamples

Page 81: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200293

2D Field Performance Visualization in SCIRun

SCIRun program

Page 82: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200294

Uintah Computational Framework (UCF)

UCF analysis Scheduling MPI library Components

500 processes Online

and offlinevisualization

Performancesteering use SCIRun

support

Page 83: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200295

Case Study: SAMRAI (LLNL)

Structured Adaptive Mesh Refinement Application Infrastructure (SAMRAI)

Programming C++ and MPI SPMD

Instrumentation PDT for automatic instrumentation of routines MPI interposition wrappers SAMRAI timers for interesting code segments

timers classified in groups (apps, mesh, …) timer groups are managed by TAU groups

Page 84: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200296

SAMRAI (Profile)

Euler (2D)

return type routine name

Page 85: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200297

SAMRAI Euler (Profile)

Page 86: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200298

SAMRAI Euler (Trace)

Page 87: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 200299

Case Study: EVH1

Enhanced Virginia Hydrodynamics #1 (EVH1) "TeraScale Simulations of Neutrino-Driven Supernovae

and Their Nucleosynthesis" SciDAC project Configured to run a simulation of the Sedov-Taylor blast

wave solution in 2D spherical geometry Performance study found EVH1 communication bound

for more than 64 processors Predominant routine (>50% of execution time) at this

scale is MPI_ALLTOALL Used in matrix transpose-like operations

Page 88: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 2002100

EVH1 Execution Profile

Page 89: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 2002101

EVH1 Execution Trace

MPI_Alltoall is an execution bottleneck

Page 90: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 2002102

TAU Integration (Selected)

SAMRAI (LLNL) Overture (LLNL) C-SAFE (ASCI ASAP) VTF (ASCI ASAP) SAGE (ASCI LANL) POOMA, POOMA-II (LANL, Code Sourcery) PETSc (ANL) CCA (DOE SciDAC) GrACE (Rutgers) Aurora / SCALEA (University of Vienna)

Page 91: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 2002103

Work in Progress

Trace visualization Event traces with counters (Vampir 3.0 will visualize) EPILOG trace conversion

Runtime performance monitoring and analysis Online performance data access Performance analysis and visualization in SCIRun

Performance Database Framework XML parallel profile representation of TAU profiles PostgresSQL performance database

Next-generation PDT Performance analysis for component software (CCA)

Page 92: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 2002104

Concluding Remarks

Complex software and parallel computing systems pose challenging performance analysis problems that require robust methodologies and tools

To build more sophisticated performance tools, existing proven performance technology must be utilized

Performance tools must be integrated with software and systems models and technology Performance engineered software Function consistently and coherently in software and

system environments TAU performance system offers robust performance

technology that can be broadly integrated … so USE IT!

Page 93: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 2002105

Acknowledgements Department of Energy (DOE)

MICS office DOE 2000 ACTS contract “Performance Technology for Tera-class Parallel Computer

Systems: Evolution of the TAU Performance System” PERC SciDAC project affiliate

University of Utah DOE ASCI Level 1 sub-contract DOE ASCI Level 3 (LANL, LLNL)

NSF National Young Investigator (NYI) award Research Centre Juelich

John von Neumann Institute for Computing Dr. Bernd Mohr

Los Alamos National Laboratory

Page 94: Allen D. Malony Sameer S. Shende Robert Bell {malony,shende,bell}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute

The TAU Performance System SC2002 PERC Tutorial, Nov. 17, 2002106

Information

TAU (http://www.acl.lanl.gov/tau) PDT (http://www.acl.lanl.gov/pdtoolkit) PAPI (http://icl.cs.utk.edu/projects/papi/) OPARI (http://www.fz-juelich.de/zam/kojak/)