chee wai lee, allen d. malony, alan morris {cheelee,malony,amorris}@cs.uoregon.edu department of...

37
Chee Wai Lee, Allen D. Malony , Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research Laboratory University of Oregon TAUmon: Scalable Online Performance Data Analysis in TAU

Upload: brook-walsh

Post on 28-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

Chee Wai Lee, Allen D. Malony, Alan Morris{cheelee,malony,amorris}@cs.uoregon.edu

Department of Computer and Information Science

Performance Research Laboratory

University of Oregon

TAUmon: Scalable Online Performance Data Analysis in TAU

Page 2: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

Outline

Motivation Brief review of prior work TAUmon design and objectives

Scalable analysis operations Transports

MRNet MPI

TAUmon experiments Perspectives on understanding applications Experiments Scaling results

Remarks2

Page 3: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 3

Motivation

Performance problem analysis is increasingly complex Multi-core, heterogeneous, and extreme scale computing Adaptive algorithms and runtime application tuning Performance dynamics variability within/between executions

Neo-performance measurement and analysis perspective Static, offline analysis dynamic, online analysis Scalable runtime analysis of parallel performance data Performance feedback to application for adaptive control Integrated performance monitoring (measurement + query)

Co-allocation of additional (tool specific) system resources Goal

Scalable, integrated parallel performance monitoring

Page 4: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

Parallel Performance Measurement and Data

Parallel performance tools measure locally and concurrently Scaling dictates “local” measurements (profile, trace)

save data with "local context" (processes or threads) Done without synchronization or central control

Parallel performance state is globally distributed as a result Logically part of application’s global data space Offline: outputs data at execution end for post-mortem analysis Online: access to performance state for runtime analysis

Definition: Monitoring Online access to parallel performance (data) state May or may not involve runtime analysis

Page 5: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 5

Monitoring for Performance Dynamics

Runtime access to parallel performance data Scalable and lightweight Raises concerns of overhead and intrusion Support for performance-adaptive, dynamic applications

Alternative 1: Extend existing performance measurement Create own integrated monitoring infrastructure Disadvantage: maintain own monitoring framework

Alternative 2: Couple with other monitoring infrastructure Leverage scalable middleware from other supported projects Challenge: measurement system / monitor integration

Page 6: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

Performance Dynamics: Parallel Profile Snapshots

Profile snapshots are parallel profiles recorded at runtime Shows performance profile dynamics (all types allowed)

Information

Ove

rhea

d Traces

Profile Snapshots

Profiles

A. Morris, W. Spear, A. Malony, and S. Shende, “Observing Performance Dynamics using Parallel Profile Snapshots,” European Conference on Parallel Processing (EuroPar), 2008.

Page 7: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

Parallel Profile Snapshots of FLASH 3.0 (UIC)

Initialization

Checkpointing

Finalization

Simulation of astrophysical thermonuclear flashes Snapshots show profile differences since last snapshot

Captures all eventssince beginningper thread

Mean profilecalculated post-mortem

Highlight changein performanceper iteration andat checkpointing

Page 8: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

FLASH 3.0 Performance Dynamics (Periodic)

INTRFC

Page 9: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

Prior Performance Monitoring Work

TAUoverSupermon (UO, Los Alamos National Laboratory)

TAUg (UO)

TAUoverMRNET (UO, University of Wisconsin, Madison)

A. Nataraj, M. Sottile, A. Morris, A. Malony, and S. Shende, “TAUoverSupermon: Low-overhead Online Parallel Performance Monitoring,” EuroPar, 2007.

A. Nataraj, A. Malony, A. Morris, D. Arnold, and B. Miller, “A Framework for Scalable, Parallel Performance Monitoring using TAU and MRNet,” Computing Concurrency and Computation: Practice and Experience, 22(6):720–735, 2009, special issue on Scalable Tools for High-End Computing.

A. Nataraj, A. Malony, A. Morris, D. Arnold, and B. Miller, “In Search of Sweet-Spots in Parallel Performance Monitoring,” Conference on Cluster Computing (Cluster 2008).

K. Huck, A. Malony, A. Morris, “TAUg: Runtime Global Performance Data Access using MPI,” EuroPVMMPI, 2006.

Page 10: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

TAUmon: Design

Design a transport-neutral application monitoring framework Base on prior / existing work with various transport systems

Supermon, MRNet, MPI Enable efficient development of monitoring functionality

Objectives Scalable access to a running application’s performance

at end of the application (before parallel teardown) while the application is still running

Support for scalable performance data analysis reduction statistical evaluation

Feedback (data, control) to application Monitoring engineering and performance efficiency issues

10

Page 11: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

TAUmon: Architecture

11

... ... ...

... ...

MPI process 0 MPI process k MPI process P-1

TAUmon

TAUprofiles threads

MPI

monitoring infrastructure

Page 12: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

TAUmon: Current Usage

TAU_ONLINE_DUMP() collective operations in application Called by all thread / processes (originally to output profiles) Arguments specify data analysis operation (future)

Appropriate version of TAU selected for transport system TAUmonMRnet: TAUmon using MRNet infrastructure TAUmonMPI: TAUmon using MPI infrastructure

User instruments application with TAU support for desired monitoring transport system (temporary)

User submits instrumented application to parallel job system Other launch systems must be submitted along with the

application to the job scheduler as needed different machine-specific job-submission scripts

12

Page 13: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

TAUmon: Parallel Profile Data Analysis

Total parallel profile data size depends on: # events * size of event * # execution threads * Event size depends on # metrics Example: 200 events * 100 bytes * 64,000 threads = 1.28 G

Monitoring operations Periodic profile data output (à la profile snapshorts) Events unification Basic statistics: mean, min/max, standard deviation, ... Advanced statistics: histogram, clustering, ...

Strong motivation to implement the operations in parallel

Page 14: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

Profile Event Unification

TAU creates events for each process individually Assigns event identifiers locally Same event can have different identifiers on each process

Analysis requires event identifiers to be unified Currently done offline

TAU must output full event information from each process Output format stores event names leading to redundancy Inflates the storage requirements (e.g., 1.28 G 5 GB)

Implement online parallel event unification Two-phase process

Page 15: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

Parallel Profile Merging

TAU creates a file for every thread of execution Profile merging will reduce the number of files generated

Profiles from each thread are sent to a root process Root process concatenates into a single file

Pre-requisite: event unification Event unification combined with profile merging leads to

more compact storage (reduced) PFLOTRAN example:

16K cores at 1.5 GB to 300 MB merged 131K cores at 27 GB to 600 MB merged

15

Page 16: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

FLASH Sod 2D | N=1024 | Allreduce

Sudden spikeat iteration 100

Basic Statistics

Mean profile Averaged values for all events and metrics across all threads Easily created using simple reduction summation operations

Can generate other basicstatistics in same way

Parallel statisticalreduction of profileevents can be very fast

Supports time-seriesobservations Significant events

by mean value

16

Page 17: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

Histogramming

Determine distribution of threads per event Dividing the range of values by a number of bins Determine number of threads with event values in each bin

Pre-requisites: min/max values and number of bins Implementation:

Broadcast min/max and # bin to each node Node decides which bins to

increment based on own its values Partial bin increments from each

node are summed via reduction treeto the root

17

1 0 13 12

2 42

Page 18: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

FLASH Sod 2D | N=1024 | Allreduce No. of Ranks

Histogramming (continued)

Histograms are useful for highlighting changes in thread distribution of a single event over time

18

Page 19: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

Basic K-Means Clustering

Discover K equivalence classes of thread behavior Defined as the vector of all its event values over a single metric

Differences in behavior measured by computing Euclidean distance between the vectors in E dimensional space where E is the number of events

19

Event: MPI_AllreduceMetric: Exclusive Time

Event: foo()Metric: Exclusive Time

Euclidean Distance over 2 dimensions

Page 20: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

K-means Clustering (continued)

Parallel K-Means Clustering algorithm (Root)

Root-1: Choose initial K centroids (event-value vectors)

Root-2: Broadcast initial centroids to each Node

Root-3: While not converged:

3a: Receive vector of changes from each Node

3b: Apply change vector to K centroids

3c: If no change to centroids and centroid membership, converged is set to true

3d: Otherwise, broadcast new centroids to each Node

Root-4: Broadcast convergence notice to each Node

20

Page 21: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

K-means Clustering (continued)

Parallel K-Means Clustering algorithm (Node)

Node-1: While not converged:

1a: Receive latest K centroid vectors from Root

1b: For each thread t’s event vector, determine which centroid it is closest to

1c: If t’s closest centroid changes from k to k-prime, subtract t’s event vector from k’s entry in the change vector and add the same value to k-prime’s entry

1d: Send change vector through the reduction tree to Root

1e: Receive convergence notification from Root Algorithm produces K mean profiles, one for each cluster Clustering reduces data and can discover performance trends

21

Page 22: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

TAUmonMRNet (a.k.a. ToM Revisited)

TAU over MRNet (ToM) Previously working with MRNet 2.1 (Cluster 2008 paper) 1-phase and 3-phase filters Explore overlay network with different span out (nodes)

TAUmon re-engineered for MRNet 3.0 (released last week!) Re-implement ToM functionality Use new MRNet support Current implementation uses pre-released MRNet 3.0 version Testing with released version

22

Page 23: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

MRNet Network Configuration

Scripts used to set up MRNet network configuration Given P = number of cores for the application, the user can

choose an appropriate N = number of tree nodes and K = fanout for deciding how to allocate sufficient computing resources for both application and MRNet

Number of network leaves can be computed as (N/K)*(K-1) Probe processes discover and partition computing resources

between the application and MRNet mrnet_topgen utility will write a topology file given K and N

and a list of processor hosts available exclusively for MRNet TAU frontend reads topology file to create the MRNet tree

and then write a new file to inform application how it can connect to the leaves of the tree

23

Page 24: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

Monitoring Operation with MRNet

Application collectively invokes TAU_ONLINE_DUMP() to start monitoring operations using current performanceinformation

TAU data is accessed and sentthrough MRNet’scommunication API viastreams and filters

Filters perform appropriateaggregation operations on data

TAU frontend is responsible forcollecting the data, storing it, andeventual delivery to a consumer

24

Page 25: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

TAUmonMPI

Use MPI-based transport No separate launch mechanisms Parallel gather operations implemented

as a binomial heap with staged MPIpoint-to-point calls (Rank 0 serves as root)

Current limitations: Application shares parallel resources with monitoring transport Monitoring operations may cause performance intrusion No user control of transport network configuration

Potential advantages Easy to use Could be more robust overall

25

. . .

rank 0

Page 26: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

TAUmon Experiments: PFLOTRAN

Predictive modeling of subsurface reactive flows Machines

ORNL Jaguar and UTK Kraken, Cray XT5 Processor counts

16,380 cores and 131Kcores, 12K (interactive) Scaling

Instrumentation (Source, PMPI) Full: 1131 events total, lots of small routines Partial: 1% exclusive + all MPI, 68 events total (44 MPI, 19 PETSc)

with and without callpaths Measurements (PAPI)

Execution time (TOT CYC) Counters: FP OPS, TOT IN, L1 DCA/DCM, L2 TCA/TCM, RES STL

26

Page 27: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

TAUmonMPI Event Unification (Cray XT5)

27

TAU unificationand merge time

Page 28: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

TAUmonMPI Scaling (PFLOTRAN, Cray XT5)

28

New histogram timings12288: 0.8643 secs24576: 0.6238

Page 29: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

TAUmonMRnet Scaling (PFLOTRAN, Cray XT5)

29

Page 30: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

TAUmonMPI Scaling (PFLOTRAN, BG/P)

30

Page 31: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

TAUmonMRnet: Snapshot (PFLOTRAN, Cray XT5)

4104 cores running with 374 extra cores for MRNet transport Each line bar shows the mean profile of an iteration

31

Page 32: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 32

TAUmonMRnet: Snapshot (PFLOTRAN, Cray XT5)

Frames (iteration) 12, 17, 21 12k PFLOTRAN execution Shifts in order of events sorted by average value over time

Page 33: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010 33

TAUmonMRnet Snapshot (FLASH, Cray XT5)

Sod 2D, 1,536 Cray XT5 cores Over 200 iterations. 15 maximum levels of refinement. MPI_Alltoall plateaus correspond to AMR refinement

Page 34: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

TAUmonMRnet Clustering (FLASH, Cray XT5)

34

MPI_Init

MPI_Alltoall COMPRESS_LIST MPI_Allreduce

DRIVER_COMPUTEDT

MPI_Init MPI_Alltoall MPI_Allreduce

DRIVER_COMPUTEDT

MPI_Alltoall MPI_Init MPI_Allreduce

DRIVER_COMPUTEDT

MPI_Alltoall COMPRESS_LIST MPI_Allreduce

Page 35: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

Validating Performance Monitoring Operations

Build parallel program that pre-loads parallel profiles Use to validated quickly onitoring operation algorithms Monitoring operation performance can be quickly observed,

analyzed, and optimized No need to pay repeated costs of running applications to a

desired point in time with real pre-generated profiles Currently developing TAUmon validation tool

35

Page 36: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

Conclusion

Scalable performance monitoring will be important Reduce volume of performance data output Take advantage of parallel analysis Provide online feedback to application

Require scalable infrastructure and integration TAUmon developed to support TAU monitoring

Targets two transport infrastructures: MRNet and MPI Demonstrated with scalable applications Prototype shows good analysis efficiency Add support for application feedback

Release of TAUmon with TAU distribution before SC10

36

Page 37: Chee Wai Lee, Allen D. Malony, Alan Morris {cheelee,malony,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research

TAUmon: Scalable Online Performance Data Analysis in TAUPROPER 2010

Support Acknowledgements

Department of Energy (DOE) Office of Science

ASC/NNSA

Department of Defense (DoD) HPC Modernization Office (HPCMO)

NSF Software Development for Cyberinfrastructure (SDCI) Research Centre Juelich Argonne National Laboratory Technical University Dresden ParaTools, Inc. NVIDIA

37