performance profiling overhead compensation for mpi programs sameer shende, allen d. malony, alan...

28
Performance Profiling Overhead Compensation for MPI Programs Sameer Shende , Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu [email protected] Performance Research Laboratory Department of Computer and Information Science University of Oregon Innovative Computing Laboratory, University of Tennessee

Post on 22-Dec-2015

222 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu

Performance Profiling Overhead Compensation for MPI Programs

Sameer Shende, Allen D. Malony,Alan Morris, Felix Wolf

{malony,sameer,amorris}@[email protected]

Performance Research LaboratoryDepartment of Computer and Information Science

University of OregonInnovative Computing Laboratory, University of Tennessee

Page 2: Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu

EuroPVM-MPI 2005 2Performance Profiling Overhead Compensation for MPI Programs

Outline

Problem description Overhead modeling and compensation analysis Motivating example

Master-worker testcase Profiling and on-the-fly compensation models Schemes to piggyback delay in message passing MPI implementation Demonstration of overhead compensation

Monte Carlo master-worker application profiling Conclusions

Page 3: Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu

EuroPVM-MPI 2005 3Performance Profiling Overhead Compensation for MPI Programs

Empirical Parallel Performance Analysis

Measurement-based analysis process Performance observation (measurement) Performance diagnosis (finding / explaining problems)

Profiling and tracing are two main measurement methods Profiling computes summary statistics

Trades online analysis for less information Extra computation with less runtime data size

Tracing captures time-sequenced records of events More analysis opportunities, including profile analysis Off-line analysis may be more complex Produces larger volume of performance information Tracing is considered to be of “higher cost”

Page 4: Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu

EuroPVM-MPI 2005 4Performance Profiling Overhead Compensation for MPI Programs

Overhead, Intrusion, and Perturbation

All performance measurements generate overhead Overhead is the cost of performance measurement

Execution time, # instructions, memory references, … Overhead causes (generates) performance intrusion

Intrusion is the dynamic performance effect of overhead Execution time slowdown, increased # instructions, … Intrusion causes (potentially) performance perturbation

Perturbation is the change in performance behavior Alteration of “probable” performance behavior Measurement does not change “possible” executions Perturbation can lead to erroneous performance results

Page 5: Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu

EuroPVM-MPI 2005 5Performance Profiling Overhead Compensation for MPI Programs

Performance Analysis Conundrum

What is the “true” parallel computation performance? Some technique must be used to observe performance Performance measurement causes intrusion Any performance intrusion might result in perturbation Performance analysis is based on performance measures

How is “accuracy” of performance analysis evaluated? How is this done when “true” performance is unknown?

Uncertainty applies to all experimental methods “Truth” lies just beyond the reach of observation Accuracy will be a relative assessment

Page 6: Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu

EuroPVM-MPI 2005 6Performance Profiling Overhead Compensation for MPI Programs

Profiling Types

Flat profiles Performance distributed onto the static program structure

Path profiles Performance associated with program execution paths

Events Entry/Exit (Begin/End) - change in metrics between events Atomic - current value of metric at event occurence

Profile analysis Inclusive statistics

performance of descendant events are included Exclusive statistics

performance of descendant events are not included

Page 7: Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu

EuroPVM-MPI 2005 7Performance Profiling Overhead Compensation for MPI Programs

Profiling Strategy and Compensation Problem

Advocate measured profiling as a method of choice Profiling intrusion reported as % slowdown in execution

Implicit assumption that overhead is equally distributed Profiling results may be distorted without compensation Parallel profiling results may show skewed results

Is it possible to account for overhead effects? Is it possible to compensate for overhead? How are profiling analyses affected? What improvements in profiling accuracy result?

Page 8: Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu

EuroPVM-MPI 2005 8Performance Profiling Overhead Compensation for MPI Programs

Overhead Compensation Methods

Trace-based (Malony, Ph.D. thesis, PPoPP ‘91 / ’92) Overhead compensation in event-based execution replay

post-mortem and off-line Analysis and repair of performance perturbations

apply Lamport’s “happened before” relation correct “errors” maintaining partial order dependencies

Both profile and trace performance analysis possible Profile-based

Need online compensation models On-the-fly measurement and profile analysis algorithm Explicit process interaction required

Page 9: Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu

EuroPVM-MPI 2005 9Performance Profiling Overhead Compensation for MPI Programs

Models for Overhead Compensation in Profiling

Overhead compensation in profiling is a harder problem “Serial” compensation models (Euro-Par 2004)

Compensation of local process overhead only Do not take into account parallel dependencies

Parallel compensation models (Euro-Par 2005) Interdependency of effects of “overhead” Must track and exchange “delay” information Attempt to correct waiting time Cannot correct execution order perturbations

but execution order perturbations can be identified Model implementation in MPI (EuroPVM-MPI 2005)

On-the-fly algorithm implemented in TAU

Page 10: Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu

EuroPVM-MPI 2005 10Performance Profiling Overhead Compensation for MPI Programs

Measured execution

Approximated execution (rational reconstruction)

Motivating Example: Parallel Master-Worker

Workers must communicateoverhead to the master inorder for the master to knowwhen messages would havebeen received

M

W1

W2

W3

t

t

M

W1

W2

W3

M encounters very littleoverhead only at the beginand end of the execution

waiting

overhead

t

t

waiting∆

1

2

3

Page 11: Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu

EuroPVM-MPI 2005 11Performance Profiling Overhead Compensation for MPI Programs

Profiling and On-the-Fly Compensation Models

Study parallel profile measurement cases Rational explanation of effects of measurement overhead Local (independent) events and measurement intrusion Interprocess (dependent) events and impact on interactions

Reconstruct execution behavior without measurement Use knowledge of events and overhead Remove overhead and recompute event timings Maintain interprocess dependencies

Learn compensation algorithms from reconstructed cases Compare measured vs. approximated executions Study enough cases until general solution appears

Page 12: Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu

EuroPVM-MPI 2005 12Performance Profiling Overhead Compensation for MPI Programs

Measured execution

Approximated execution (rational reconstruction)

One-Message Scenario (Case 1)

P1

P2

S

Rb Re

w

o1

o2

P1

P2

S

RbRe

Sending process must tellother process how muchearlier message would have been sent

w’ = 0o2’ = o2 + wx2 = min(o1, o2+w)

= o2 + w

o1 (= x1)

x2

o1 >= o2 + w

Overhead must absorberrononeous waiting!!

x1

t

t

t

t

x1 and x2 represent “delay” of future events

S - send Rb - receive beginw - wait Re - receive end

Page 13: Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu

EuroPVM-MPI 2005 13Performance Profiling Overhead Compensation for MPI Programs

Measured execution

Approximated execution

One-Message Scenario (Case 2)

P1

P2

S

Rb Re

w

o1

o2

P1

P2

S

Rb Re

o1 < o2 + w

w’ = w + (o2-o1)o2’ = o2 - (o1-o2)

if o1>o2x2 = min(o1, o2+w) = o1

o1 (=x1)

x2w’

Waiting time may increase!!

x1

t

t

t

t

x1 and x2 represent “delay” of future events

S - send Rb - receive beginw - wait Re - receive end

Page 14: Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu

EuroPVM-MPI 2005 14Performance Profiling Overhead Compensation for MPI Programs

General Algorithm

Based on generalization of two-process model Update local overhead and delay based on measurement Update local overhead and delay based on messages

receive messages only use report delay values from sender

Process transmits local delay with every send message Important to note that only overhead value used

Profile calculations subtract only the overhead inclusive and exclusive performance calculations

Implement general algorithm in parallel profiling systems Implemented in the TAU performance system

Page 15: Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu

EuroPVM-MPI 2005 15Performance Profiling Overhead Compensation for MPI Programs

How Is the Local Delay Communicated?

Sender must send local delay in each message The problem is how to do this Need to avoid adding extra intrusion Need to avoid further perturbing performance Goal

Provide a widely portable prototype Efficiently implemented and easily applied

Capability is not currently available in MPI library Build prototype for MPI in TAU

Page 16: Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu

EuroPVM-MPI 2005 16Performance Profiling Overhead Compensation for MPI Programs

MPI Implementation – Scheme 1

Ideally the delay would be included in each send message Look for methods to “piggyback” delay on messages Modify source code of underlying MPI implementation Extend message header in communication substrate Approach taken by Photon Not portable to all MPI implementations Relies on specially instrumented communication library

Page 17: Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu

EuroPVM-MPI 2005 17Performance Profiling Overhead Compensation for MPI Programs

MPI Implementation – Scheme 2

Send an additional message containing delay information Done using portable MPI wrapper interposition library

PMPI Portable to all MPI implementations

Performance penalty with extra message transmission Penalty not incurred in first scheme Penalty is both overhead and perturbation Would require further compensation

Delay information should be tied to original message Hard to guarantee a tight coupling Could lead to other problems if not

Page 18: Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu

EuroPVM-MPI 2005 18Performance Profiling Overhead Compensation for MPI Programs

MPI Implementation – Scheme 3

Copy contents of original message to a new message Create new message header to include delay information Send new message and receive new message Receiver must copy original message to destination Portability advantage of second scheme

Could be implemented using PMPI Also avoids transmission of additional message Copying message contents is expensive

Must regard as overhead and compensate

Page 19: Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu

EuroPVM-MPI 2005 19Performance Profiling Overhead Compensation for MPI Programs

MPI Implementation – Scheme 4

Problem with third scheme is creating a new message Needed because put delay information in header Suppose we put the delay information in the message Problem is that cannot modify original message Idea is to create a new “structured” datatype

Contains two members pointer to original message buffer (n elements of datatype)

n elements of datatype passed to original MPI call double precision number containing the local delay

Committed as a new user-defined datatype MPI instructed to send or receive one element of datatype

Page 20: Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu

EuroPVM-MPI 2005 20Performance Profiling Overhead Compensation for MPI Programs

MPI Implementation – Scheme 4 (continued)

Only one message is sent Avoids expensive copying of data buffers

MPI decides internally how to transmit the message One option is to use vector reads and write calls Instead of scalar counterparts

Solution is portable to all MPI implementations Structured datatype are defined in MPI standard Can be implemented using PMPI MPI implementation for efficient transmission

Must wrap each MPI API Want to handle both synchronous and asynchronous calls

Page 21: Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu

EuroPVM-MPI 2005 21Performance Profiling Overhead Compensation for MPI Programs

Mapping MPI Calls – Send and Receive

Synchronous (MPI_Send and MPI_Recv) Auto variable holding delay value allocated on stack

Asynchronous (MPI_Isend and MPI_Irecv) Sender implementation

global variable in heap memory used to store delay value location of variable is used in new datatype structure

Receiver implementation received piggyback delay copied to heap location need to create a map linking delay value to MPI call cannot put calculations in MPI_Isend and MPI_Irecv message is only visible in waiting or testing calls

MPI_Wait, MPI_Test (or variants)

Page 22: Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu

EuroPVM-MPI 2005 22Performance Profiling Overhead Compensation for MPI Programs

Mapping MPI Calls – Collective Operations

Asynchronous MPI calls can be perturbed by overhead Different receive order than without measurement Must maintain receive order to ensure determinancy

We know more about collective operations Consider MPI_Gather

Extract all piggybacked delays values into an array Compute minimum delay from MPI communicator

effectively identifies last process to arrive to gather Root waiting time adjusted based on this minimum delay

Collective operations reduce to finding minimum delay Overhead compensated accordingly

Page 23: Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu

EuroPVM-MPI 2005 23Performance Profiling Overhead Compensation for MPI Programs

Mapping MPI Calls – More Collective Operations

MPI_Bcast reduces to synchronous send/receive For each process in MPI communicator

MPI_Scatter behaves like MPI_Gather Same as receiving a message from root in all tasks

MPI_Barrier implemented as a combination MPI_Gather of local delays to root

root find minimum delay and adjust waiting time root determines its local delay

MPI_Bcast to communicate root’s local delay Done for all processes in communicator

Efficiencies of underlying MPI substrate preserved

Page 24: Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu

EuroPVM-MPI 2005 24Performance Profiling Overhead Compensation for MPI Programs

Measured execution

Compensated execution

Compensated Parallel Master-Worker Scenario

M

W1

W2

W3

M

W1

W2

W3

Must maintain receiveevent ordering inoverhead compensation!!!

waiting

waiting

Page 25: Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu

EuroPVM-MPI 2005 25Performance Profiling Overhead Compensation for MPI Programs

Master-Worker Overhead Compensation in TAU

MPI program to compute π with Monte Carlo integration Master generates work (pair of random coordinates) Workers determine if coordinates above or below π curve Iteratively estimate to within a given range

Four execution modes No instrumentation Full with no compensation Full with local only compensation Full with parallel compensation

Master Worker

73.926 73.834

128.179 128.173

139.56 73.212

74.126 73.909

Page 26: Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu

EuroPVM-MPI 2005 26Performance Profiling Overhead Compensation for MPI Programs

Full instrumentation, no compensation

Full instrumentation, local compensation

MPI only instrumentations

Full instrumentation, parallel compensation

Monte Carlo Integration of π – Profiles

Compare TAU profiles for different instrumentations Profile many application events to generate overhead

W1W2W3M

W1W2W3M

W1W2W3M

W1W2W3M

74% error!referenceexecution

89% error in master!

1.0 - 1.4% error with all events!

Page 27: Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu

EuroPVM-MPI 2005 27Performance Profiling Overhead Compensation for MPI Programs

Conclusion

Developed models for parallel overhead compensation Account for interprocess dependencies Identified need to communicate “delay”

Constructed on-the-fly algorithms based on models Support message passing parallelism Integrated in TAU parallel profiling system

Validated parallel overhead compensation Master-worker application

Extend techniques for semantics-based compensation Utilize knowledge communication operation

Page 28: Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf {malony,sameer,amorris}@cs.uoregon.edu

EuroPVM-MPI 2005 28Performance Profiling Overhead Compensation for MPI Programs

Acknowledgements

Department of Energy (DOE) MICS office

“Performance Technology for Tera-class Parallel Computer Systems: Evolution of the TAU Performance System”

“Performance Technology for Productive Parallel Computing”

NNSA/ASC University of Utah DOE ASCI Level 1 sub-contract ASCI Level 3 project (LANL, LLNL, SNL)

URLs TAU: http://www.cs.uoregon.edu/research/tau PDT: http://www.cs.uoregon.edu/research/pdt