performance profiling overhead compensation for mpi programs sameer shende, allen d. malony, alan...
Post on 22-Dec-2015
222 views
TRANSCRIPT
Performance Profiling Overhead Compensation for MPI Programs
Sameer Shende, Allen D. Malony,Alan Morris, Felix Wolf
{malony,sameer,amorris}@[email protected]
Performance Research LaboratoryDepartment of Computer and Information Science
University of OregonInnovative Computing Laboratory, University of Tennessee
EuroPVM-MPI 2005 2Performance Profiling Overhead Compensation for MPI Programs
Outline
Problem description Overhead modeling and compensation analysis Motivating example
Master-worker testcase Profiling and on-the-fly compensation models Schemes to piggyback delay in message passing MPI implementation Demonstration of overhead compensation
Monte Carlo master-worker application profiling Conclusions
EuroPVM-MPI 2005 3Performance Profiling Overhead Compensation for MPI Programs
Empirical Parallel Performance Analysis
Measurement-based analysis process Performance observation (measurement) Performance diagnosis (finding / explaining problems)
Profiling and tracing are two main measurement methods Profiling computes summary statistics
Trades online analysis for less information Extra computation with less runtime data size
Tracing captures time-sequenced records of events More analysis opportunities, including profile analysis Off-line analysis may be more complex Produces larger volume of performance information Tracing is considered to be of “higher cost”
EuroPVM-MPI 2005 4Performance Profiling Overhead Compensation for MPI Programs
Overhead, Intrusion, and Perturbation
All performance measurements generate overhead Overhead is the cost of performance measurement
Execution time, # instructions, memory references, … Overhead causes (generates) performance intrusion
Intrusion is the dynamic performance effect of overhead Execution time slowdown, increased # instructions, … Intrusion causes (potentially) performance perturbation
Perturbation is the change in performance behavior Alteration of “probable” performance behavior Measurement does not change “possible” executions Perturbation can lead to erroneous performance results
EuroPVM-MPI 2005 5Performance Profiling Overhead Compensation for MPI Programs
Performance Analysis Conundrum
What is the “true” parallel computation performance? Some technique must be used to observe performance Performance measurement causes intrusion Any performance intrusion might result in perturbation Performance analysis is based on performance measures
How is “accuracy” of performance analysis evaluated? How is this done when “true” performance is unknown?
Uncertainty applies to all experimental methods “Truth” lies just beyond the reach of observation Accuracy will be a relative assessment
EuroPVM-MPI 2005 6Performance Profiling Overhead Compensation for MPI Programs
Profiling Types
Flat profiles Performance distributed onto the static program structure
Path profiles Performance associated with program execution paths
Events Entry/Exit (Begin/End) - change in metrics between events Atomic - current value of metric at event occurence
Profile analysis Inclusive statistics
performance of descendant events are included Exclusive statistics
performance of descendant events are not included
EuroPVM-MPI 2005 7Performance Profiling Overhead Compensation for MPI Programs
Profiling Strategy and Compensation Problem
Advocate measured profiling as a method of choice Profiling intrusion reported as % slowdown in execution
Implicit assumption that overhead is equally distributed Profiling results may be distorted without compensation Parallel profiling results may show skewed results
Is it possible to account for overhead effects? Is it possible to compensate for overhead? How are profiling analyses affected? What improvements in profiling accuracy result?
EuroPVM-MPI 2005 8Performance Profiling Overhead Compensation for MPI Programs
Overhead Compensation Methods
Trace-based (Malony, Ph.D. thesis, PPoPP ‘91 / ’92) Overhead compensation in event-based execution replay
post-mortem and off-line Analysis and repair of performance perturbations
apply Lamport’s “happened before” relation correct “errors” maintaining partial order dependencies
Both profile and trace performance analysis possible Profile-based
Need online compensation models On-the-fly measurement and profile analysis algorithm Explicit process interaction required
EuroPVM-MPI 2005 9Performance Profiling Overhead Compensation for MPI Programs
Models for Overhead Compensation in Profiling
Overhead compensation in profiling is a harder problem “Serial” compensation models (Euro-Par 2004)
Compensation of local process overhead only Do not take into account parallel dependencies
Parallel compensation models (Euro-Par 2005) Interdependency of effects of “overhead” Must track and exchange “delay” information Attempt to correct waiting time Cannot correct execution order perturbations
but execution order perturbations can be identified Model implementation in MPI (EuroPVM-MPI 2005)
On-the-fly algorithm implemented in TAU
EuroPVM-MPI 2005 10Performance Profiling Overhead Compensation for MPI Programs
Measured execution
Approximated execution (rational reconstruction)
Motivating Example: Parallel Master-Worker
Workers must communicateoverhead to the master inorder for the master to knowwhen messages would havebeen received
M
W1
W2
W3
t
t
M
W1
W2
W3
M encounters very littleoverhead only at the beginand end of the execution
waiting
overhead
t
t
∆
waiting∆
1
2
3
EuroPVM-MPI 2005 11Performance Profiling Overhead Compensation for MPI Programs
Profiling and On-the-Fly Compensation Models
Study parallel profile measurement cases Rational explanation of effects of measurement overhead Local (independent) events and measurement intrusion Interprocess (dependent) events and impact on interactions
Reconstruct execution behavior without measurement Use knowledge of events and overhead Remove overhead and recompute event timings Maintain interprocess dependencies
Learn compensation algorithms from reconstructed cases Compare measured vs. approximated executions Study enough cases until general solution appears
EuroPVM-MPI 2005 12Performance Profiling Overhead Compensation for MPI Programs
Measured execution
Approximated execution (rational reconstruction)
One-Message Scenario (Case 1)
P1
P2
S
Rb Re
w
o1
o2
P1
P2
S
RbRe
Sending process must tellother process how muchearlier message would have been sent
w’ = 0o2’ = o2 + wx2 = min(o1, o2+w)
= o2 + w
o1 (= x1)
x2
o1 >= o2 + w
Overhead must absorberrononeous waiting!!
x1
t
t
t
t
x1 and x2 represent “delay” of future events
S - send Rb - receive beginw - wait Re - receive end
EuroPVM-MPI 2005 13Performance Profiling Overhead Compensation for MPI Programs
Measured execution
Approximated execution
One-Message Scenario (Case 2)
P1
P2
S
Rb Re
w
o1
o2
P1
P2
S
Rb Re
o1 < o2 + w
w’ = w + (o2-o1)o2’ = o2 - (o1-o2)
if o1>o2x2 = min(o1, o2+w) = o1
o1 (=x1)
x2w’
Waiting time may increase!!
x1
t
t
t
t
x1 and x2 represent “delay” of future events
S - send Rb - receive beginw - wait Re - receive end
EuroPVM-MPI 2005 14Performance Profiling Overhead Compensation for MPI Programs
General Algorithm
Based on generalization of two-process model Update local overhead and delay based on measurement Update local overhead and delay based on messages
receive messages only use report delay values from sender
Process transmits local delay with every send message Important to note that only overhead value used
Profile calculations subtract only the overhead inclusive and exclusive performance calculations
Implement general algorithm in parallel profiling systems Implemented in the TAU performance system
EuroPVM-MPI 2005 15Performance Profiling Overhead Compensation for MPI Programs
How Is the Local Delay Communicated?
Sender must send local delay in each message The problem is how to do this Need to avoid adding extra intrusion Need to avoid further perturbing performance Goal
Provide a widely portable prototype Efficiently implemented and easily applied
Capability is not currently available in MPI library Build prototype for MPI in TAU
EuroPVM-MPI 2005 16Performance Profiling Overhead Compensation for MPI Programs
MPI Implementation – Scheme 1
Ideally the delay would be included in each send message Look for methods to “piggyback” delay on messages Modify source code of underlying MPI implementation Extend message header in communication substrate Approach taken by Photon Not portable to all MPI implementations Relies on specially instrumented communication library
EuroPVM-MPI 2005 17Performance Profiling Overhead Compensation for MPI Programs
MPI Implementation – Scheme 2
Send an additional message containing delay information Done using portable MPI wrapper interposition library
PMPI Portable to all MPI implementations
Performance penalty with extra message transmission Penalty not incurred in first scheme Penalty is both overhead and perturbation Would require further compensation
Delay information should be tied to original message Hard to guarantee a tight coupling Could lead to other problems if not
EuroPVM-MPI 2005 18Performance Profiling Overhead Compensation for MPI Programs
MPI Implementation – Scheme 3
Copy contents of original message to a new message Create new message header to include delay information Send new message and receive new message Receiver must copy original message to destination Portability advantage of second scheme
Could be implemented using PMPI Also avoids transmission of additional message Copying message contents is expensive
Must regard as overhead and compensate
EuroPVM-MPI 2005 19Performance Profiling Overhead Compensation for MPI Programs
MPI Implementation – Scheme 4
Problem with third scheme is creating a new message Needed because put delay information in header Suppose we put the delay information in the message Problem is that cannot modify original message Idea is to create a new “structured” datatype
Contains two members pointer to original message buffer (n elements of datatype)
n elements of datatype passed to original MPI call double precision number containing the local delay
Committed as a new user-defined datatype MPI instructed to send or receive one element of datatype
EuroPVM-MPI 2005 20Performance Profiling Overhead Compensation for MPI Programs
MPI Implementation – Scheme 4 (continued)
Only one message is sent Avoids expensive copying of data buffers
MPI decides internally how to transmit the message One option is to use vector reads and write calls Instead of scalar counterparts
Solution is portable to all MPI implementations Structured datatype are defined in MPI standard Can be implemented using PMPI MPI implementation for efficient transmission
Must wrap each MPI API Want to handle both synchronous and asynchronous calls
EuroPVM-MPI 2005 21Performance Profiling Overhead Compensation for MPI Programs
Mapping MPI Calls – Send and Receive
Synchronous (MPI_Send and MPI_Recv) Auto variable holding delay value allocated on stack
Asynchronous (MPI_Isend and MPI_Irecv) Sender implementation
global variable in heap memory used to store delay value location of variable is used in new datatype structure
Receiver implementation received piggyback delay copied to heap location need to create a map linking delay value to MPI call cannot put calculations in MPI_Isend and MPI_Irecv message is only visible in waiting or testing calls
MPI_Wait, MPI_Test (or variants)
EuroPVM-MPI 2005 22Performance Profiling Overhead Compensation for MPI Programs
Mapping MPI Calls – Collective Operations
Asynchronous MPI calls can be perturbed by overhead Different receive order than without measurement Must maintain receive order to ensure determinancy
We know more about collective operations Consider MPI_Gather
Extract all piggybacked delays values into an array Compute minimum delay from MPI communicator
effectively identifies last process to arrive to gather Root waiting time adjusted based on this minimum delay
Collective operations reduce to finding minimum delay Overhead compensated accordingly
EuroPVM-MPI 2005 23Performance Profiling Overhead Compensation for MPI Programs
Mapping MPI Calls – More Collective Operations
MPI_Bcast reduces to synchronous send/receive For each process in MPI communicator
MPI_Scatter behaves like MPI_Gather Same as receiving a message from root in all tasks
MPI_Barrier implemented as a combination MPI_Gather of local delays to root
root find minimum delay and adjust waiting time root determines its local delay
MPI_Bcast to communicate root’s local delay Done for all processes in communicator
Efficiencies of underlying MPI substrate preserved
EuroPVM-MPI 2005 24Performance Profiling Overhead Compensation for MPI Programs
Measured execution
Compensated execution
Compensated Parallel Master-Worker Scenario
M
W1
W2
W3
M
W1
W2
W3
Must maintain receiveevent ordering inoverhead compensation!!!
waiting
waiting
EuroPVM-MPI 2005 25Performance Profiling Overhead Compensation for MPI Programs
Master-Worker Overhead Compensation in TAU
MPI program to compute π with Monte Carlo integration Master generates work (pair of random coordinates) Workers determine if coordinates above or below π curve Iteratively estimate to within a given range
Four execution modes No instrumentation Full with no compensation Full with local only compensation Full with parallel compensation
Master Worker
73.926 73.834
128.179 128.173
139.56 73.212
74.126 73.909
EuroPVM-MPI 2005 26Performance Profiling Overhead Compensation for MPI Programs
Full instrumentation, no compensation
Full instrumentation, local compensation
MPI only instrumentations
Full instrumentation, parallel compensation
Monte Carlo Integration of π – Profiles
Compare TAU profiles for different instrumentations Profile many application events to generate overhead
W1W2W3M
W1W2W3M
W1W2W3M
W1W2W3M
74% error!referenceexecution
89% error in master!
1.0 - 1.4% error with all events!
EuroPVM-MPI 2005 27Performance Profiling Overhead Compensation for MPI Programs
Conclusion
Developed models for parallel overhead compensation Account for interprocess dependencies Identified need to communicate “delay”
Constructed on-the-fly algorithms based on models Support message passing parallelism Integrated in TAU parallel profiling system
Validated parallel overhead compensation Master-worker application
Extend techniques for semantics-based compensation Utilize knowledge communication operation
EuroPVM-MPI 2005 28Performance Profiling Overhead Compensation for MPI Programs
Acknowledgements
Department of Energy (DOE) MICS office
“Performance Technology for Tera-class Parallel Computer Systems: Evolution of the TAU Performance System”
“Performance Technology for Productive Parallel Computing”
NNSA/ASC University of Utah DOE ASCI Level 1 sub-contract ASCI Level 3 project (LANL, LLNL, SNL)
URLs TAU: http://www.cs.uoregon.edu/research/tau PDT: http://www.cs.uoregon.edu/research/pdt