allen d. malony, sameer shende, li li, kevin huck {malony,sameer,lili,khuck}@cs.uoregon.edu...

45
Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance Research Laboratory University of Oregon Parallel Performance Mapping, Diagnosis, and Data Mining

Upload: alexander-taylor

Post on 13-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu

Department of Computer and Information Science

Performance Research Laboratory

University of Oregon

Parallel Performance Mapping,Diagnosis, and Data Mining

Page 2: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 2

Research Motivation

Tools for performance problem solving Empirical-based performance optimization process Performance technology concerns

characterization

PerformanceTuning

PerformanceDiagnosis

PerformanceExperimentation

PerformanceObservation

hypotheses

properties

• Instrumentation• Measurement• Analysis• Visualization

PerformanceTechnology

• Experimentmanagement

• Performancestorage

PerformanceTechnology

Page 3: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 3

Challenges in Performance Problem Solving

How to make the process more effective (productive)? Process may depend on scale of parallel system What are the important events and performance metrics?

Tied to application structure and computational model Tied to application domain and algorithms

Process and tools can/must be more application-aware Tools have poor support for application-specific aspects

What are the significant issues that will affect the technology used to support the process?

Enhance application development and benchmarking New paradigm in performance process and technology

Page 4: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 4

Large Scale Performance Problem Solving

How does our view of this process change when we consider very large-scale parallel systems?

What are the significant issues that will affect the technology used to support the process?

Parallel performance observation is clearly needed In general, there is the concern for intrusion

Seen as a tradeoff with performance diagnosis accuracy Scaling complicates observation and analysis

Performance data size becomes a concern Analysis complexity increases

Nature of application development may change

Page 5: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 5

Role of Intelligence, Automation, and Knowledge

Scale forces the process to become more intelligent Even with intelligent and application-specific tools, the

decisions of what to analyze is difficult and intractable More automation and knowledge-based decision making Build automatic/autonomic capabilities into the tools

Support broader experimentation methods and refinement Access and correlate data from several sources Automate performance data analysis / mining / learning Include predictive features and experiment refinement

Knowledge-driven adaptation and optimization guidance Will allow scalability issues to be addressed in context

Page 6: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 6

Outline of Talk

Performance problem solving Scalability, productivity, and performance technology Application-specific and autonomic performance tools

TAU parallel performance system (Bernd said “No!”) Parallel performance mapping Performance data management and data mining

Performance Data Management Framework (PerfDMF) PerfExplorer

Model-based parallel performance diagnosis Poirot and Hercule

Conclusions

Page 7: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 7

TAU Performance System

eventselection

Page 8: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 8

Semantics-Based Performance Mapping

Associate performance measurements with high-level semantic abstractions

Need mapping support in the performance measurement system to assign data correctly

Page 9: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 9

Hypothetical Mapping Example Particles distributed on surfaces of a cubeParticle* P[MAX]; /* Array of particles */

int GenerateParticles() {

/* distribute particles over all faces of the cube */

for (int face=0, last=0; face < 6; face++){

/* particles on this face */

int particles_on_this_face = num(face);

for (int i=last; i < particles_on_this_face; i++) {

/* particle properties are a function of face */ P[i] = ... f(face);

...

}

last+= particles_on_this_face;

}

}

Page 10: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 10

Hypothetical Mapping Example (continued)

How much time (flops) spent processing face i particles? What is the distribution of performance among faces? How is this determined if execution is parallel?

int ProcessParticle(Particle *p) {

/* perform some computation on p */

}

int main() {

GenerateParticles();

/* create a list of particles */

for (int i = 0; i < N; i++)

/* iterates over the list */

ProcessParticle(P[i]);

}

engine

workpackets

Page 11: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 11

No Performance Mapping versus Mapping

Typical performance tools report performance with respect to routines

Does not provide support for mapping

TAU’s performance mapping can observe performance with respect to scientist’s programming and problem abstractions

TAU (no mapping) TAU (w/ mapping)

Page 12: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 12

ParaMap (Miller and Irvin) Low-level performance to high-level source constructs Noun-Verb (NV) model to describe the mapping

noun is an program entity verb represents an action performed on a noun sentences (nouns and verb) map to other sentences

Mappings: static, dynamic, set of active sentences (SAS) Semantics Entities / Abstractions/ Associations (SEAA)

Entities defined at any level of abstraction (user-level) Attribute entity with semantic information Entity-to-entity associations Target measurement layer and asynchronous operation

Performance Mapping Approaches

Page 13: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 13

Two association types (implemented in TAU API) Embedded – extends associated

object to store performancemeasurement entity

External – creates an externallook-up table using address ofobject as key to locate performancemeasurement entity

Implemented in TAU API Applied to performance measurement problems

callpath/phase profiling, C++ templates, …

SEAA Implementation

Page 14: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 14

Uintah Problem Solving Environment (PSE)

Uintah component architecture for Utah C-SAFE project Application programmers provide:

description of computation (tasks and variables) code to perform task on single “patch” (sub-region of

space) Components for scheduling, partitioning, load balance, …

Uintah Computational Framework (UCF) Execution model based on software (macro) dataflow

computations expressed a directed acyclic graphs of tasks input/outputs specified for each patch in a structured grid

Abstraction of global single-assignment memory Task graph gets mapped to processing resources Communications schedule approximates global optimal

Page 15: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 15

Uintah Task Graph (Material Point Method)

Diagram of named tasks (ovals) and data (edges)

Imminent computation Dataflow-constrained

MPM Newtonian material point

motion time step Solid: values defined at

material point (particle) Dashed: values defined at

vertex (grid) Prime (’): values updated

during time step

Page 16: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 16

Task Execution in Uintah Parallel Scheduler

Profile methods and functions in scheduler and in MPI library

Need to map performance data!

Task execution time dominates (what task?)

MPI communication overheads (where?)

Task execution time distribution

Page 17: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 17

Mapping Instrumentation in UCF (example)

Use TAU performance mapping APIvoid MPIScheduler::execute(const ProcessorGroup * pc,

DataWarehouseP & old_dw, DataWarehouseP & dw ) {

...TAU_MAPPING_CREATE(

task->getName(), "[MPIScheduler::execute()]", (TauGroup_t)(void*)task->getName(), task->getName(), 0);...TAU_MAPPING_OBJECT(tautimer)TAU_MAPPING_LINK(tautimer,(TauGroup_t)(void*)task->getName());

// EXTERNAL ASSOCIATION...TAU_MAPPING_PROFILE_TIMER(doitprofiler, tautimer, 0)TAU_MAPPING_PROFILE_START(doitprofiler,0);task->doit(pc);TAU_MAPPING_PROFILE_STOP(0);...

}

Page 18: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 18

Task Performance Mapping (Profile)

Performance mapping for different tasks

Mapped task performance across processes

Page 19: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 19

Work Packet – to – Task Mapping (Trace)

Work packet computation events colored by task type

Distinct phases of computation can be identifed based on task

Page 20: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 20

Comparing Uintah Traces for Scalability Analysis

8 processes

8 processes

32 processes

32 processes

Page 21: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 21

Important Questions for Application Developers

How does performance vary with different compilers? Is poor performance correlated with certain OS features? Has a recent change caused unanticipated performance? How does performance vary with MPI variants? Why is one application version faster than another? What is the reason for the observed scaling behavior? Did two runs exhibit similar performance? How are performance data related to application events? Which machines will run my code the fastest and why? Which benchmarks predict my code performance best?

Page 22: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 22

Performance Problem Solving Goals Answer questions at multiple levels of interest

Data from low-level measurements and simulations use to predict application performance

High-level performance data spanning dimensions machine, applications, code revisions, data sets examine broad performance trends

Discover general correlations application performance and features of their external environment

Develop methods to predict application performance on lower-level metrics

Discover performance correlations between a small set of benchmarks and a collection of applications that represent a typical workload for a given system

Page 23: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 23

Empirical-Based Performance Optimization

characterization

PerformanceTuning

PerformanceDiagnosis

PerformanceExperimentation

PerformanceObservation

hypotheses

properties

observabilityrequirements ?

ProcessExperiment

Schemas

ExperimentTrials

Experimentmanagement

Page 24: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 24

Performance Data Management Framework

ICPP 2005 paper

Page 25: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 25

PerfExplorer (K. Huck, Ph.D. student, UO) Performance knowledge discovery framework

Use the existing TAU infrastructure TAU instrumentation data, PerfDMF

Client-server based system architecture Data mining analysis applied to parallel performance data

comparative, clustering, correlation, dimension reduction, ...

Technology integration Relational DatabaseManagement Systems (RDBMS) Java API and toolkit R-project / Omegahat statistical analysis WEKA data mining package Web-based client

Page 26: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 26

PerfExplorer Architecture

SC’05 paper

Page 27: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 27

PerfExplorer Client GUI

Page 28: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 28

Hierarchical and K-means Clustering (sPPM)

Page 29: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 29

Miranda Clustering on 16K Processors

Page 30: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 30

Parallel Performance Diagnosis

Performance tuning process Process to find and report performance problems Performance diagnosis: detect and explain problems Performance optimization: performance problem repair

Experts approach systematically and use experience Hard to formulate and automate expertise Performance optimization is fundamentally hard

Focus on the performance diagnosis problem Characterize diagnosis processes How it integrates with performance experimentation Understand the knowledge engineering

Page 31: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 31

Parallel Performance Diagnosis Architecture

Page 32: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 32

Performance Diagnosis System Architecture

Page 33: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 33

Problems in Existing Diagnosis Approaches

Low-level abstraction of properties/metrics Independent of program semantics Relate to component structure

not algorithmic structure or parallelism model

Insufficient explanation power Hard to interpret in the context of program semantics Performance behavior not tied to operational parallelism

Low applicability and adaptability Difficult to apply in different contexts Hard to adapt to new requirements

Page 34: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 34

Poirot Project

Lack of a formal theory of diagnosis processes Compare and analyze performance diagnosis systems Use theory to create system that is automated / adaptable

Poirot performance diagnosis (theory, architecture) Survey of diagnosis methods / strategies in tools Heuristic classification approach (match to characteristics) Heuristic search approach (based on problem knowledge)

Problems Descriptive results do not explain with respect to context

users must reason about high-level causes Performance experimentation not guided by diagnosis

Lacks automation

Page 35: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 35

Model-Based Approach

Knowledge-based performance diagnosis Capture knowledge about performance problems Capture knowledge about how to detect and explain them

Where does the knowledge come from? Extract from parallel computational models

Structural and operational characteristics Associate computational models with performance

Do parallel computational models help in diagnosis? Enables better understanding of problems Enables more specific experimentation Enables more efffective hypothesize testing and search

Page 36: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 36

Implications for Performance Diagnosis

Models benefit performance diagnosis Base instrumentation on program semantics Capture performance-critical features Enable explanations close to user’s understanding

of computation operation of performance behavior

Reuse performance analysis expertise on the commonly-used models

Model examples Master-worker model Pipeline Divide-and-conquer Domain

decomposition Phase-based Compositional

Page 37: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 37

Hercule Project

Goals of automation , adaptability, validation

Page 38: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 38

Approach

Make use of model knowledge to diagnose performance Start with commonly-used computational models Engineering model knowledge Integrate model knowledge with performance

measurement system Build a cause inference system

define “causes” at parallelism level build causality relation

between the low-level “effects” and the “causes”

Page 39: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 39

Master-Worker Parallel Computation Model

Page 40: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 40

Init. or final.time significant

1.Insufficient-parallelism

Low speedup

3.Master-being-bottleneck

Worker numbersaturation

Worker starvation

master-assign-tasktime significant

2.Fine-granularity

Large amount of messageexchanged every time

: Hypotheses

: Causesnumber : priority

Num of reqs inmaster queue > Κ1

in some time intervals

Waiting long time for Master assigning each

individual taskSuch intervals

>Κ2 Such intervals <Κ2

+ + +

+

Time imbalance

4. Some workersNoticeably inefficient

+

Κi : threshold

+ : coexistence

: Observation

The workers waited quite a while in master queue in

Some time intervals +

Performance Diagnosis Inference Tree (MW)

Page 41: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 41

Knowledge Engineering - Abstract Event (MW)

Use CLIPS expert system building tool

Page 42: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 42

Diagnosis Results Output (MW)

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 43: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 43

Experimental Diagnosis Results (MW)

Page 44: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 44

Concluding Discussion

Performance tools must be used effectively More intelligent performance systems for productive use

Evolve to application-specific performance technology Deal with scale by “full range” performance exploration Autonomic and integrated tools Knowledge-based and knowledge-driven process

Performance observation methods do not necessarily need to change in a fundamental sense More automatically controlled and efficiently use

Support model-driven performance diagnosis Develop next-generation tools and deliver to community

Page 45: Allen D. Malony, Sameer Shende, Li Li, Kevin Huck {malony,sameer,lili,khuck}@cs.uoregon.edu Department of Computer and Information Science Performance

Parallel Performance Mapping, Diagnosis, and Data MiningParCo 2005 45

Support Acknowledgements

Department of Energy (DOE) Office of Science contracts University of Utah ASCI Level 1

sub-contract ASC/NNSA Level 3 contract

NSF High-End Computing Grant

Research Centre Juelich John von Neumann Institute Dr. Bernd Mohr

Los Alamos National Laboratory

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.