aroon nataraj, alan morris, allen malony, matthew sottile, pete beckman l {anataraj, amorris,...
Embed Size (px)
TRANSCRIPT

Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckmanl
{anataraj, amorris, malony, matt}@cs.uoregon.edu
Department of Computer and Information Science
University of Oregon
Mathematics and Computer Science Division
Argonne National Laboratory
The Ghost in the Machine Observing the Effects of Kernel Operation on
Parallel Application Performance

2Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007
Outline of the Talk Part A: Operating System Interference & Us
Introduction to OS/R Interference Research Questions & Our Contribution Contrasting to State-of-the-Art Noise Measurement
Part B The (K)TAU Performance System Fine-grained OS/App performance Correlation The Noise-Effect Estimation Technique Described
Part C Evaluating accuracy of Noise-Estimation @ Scale Demonstrating Noise-Estimation Conclusion & Future Plans

3Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007
Parallel Applications’ Noise PropagationIntroduction to Operating System Interference
Waiting timedue to OS
Overheadaccumulates
Local Cycles: 3Global Cycles: 9
Compute
Compute
Compute
w
w
Col
lect
ive
Compute
Compute
Compute
w
w
Col
lect
ive
Compute
Compute
Compute
w
w
Col
lect
ive
os
os
os
Simple Parallel Application
Phases of Alternating Communication & Reduction
What is the cause of variability?

4Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007
Is OS Noise a problem? When? Earlier work has shown significant OS/R interference problems
Petrini et. al. SC’03 ASCI Q, P=4,096 | SAGE took twice as long predicted by performance model
Jones et. al. SC’03, IBM SP | Degraded linear scaling performance of allreduce()
Cause identified as OS/R Interference Noise Distribution Matters
Agarwal et. al. HIPC’05 - Use theoretical approach Heavy-tailed & Bernoulli noise most detrimental
Beckman et. al. CCJ’07 - Noise Emulation (Injection) Max Noise duration, for some noise-ratio, determines noise effects on collectives
Our own noise-injection experiments show noise can be a problem
Introduction to Operating System Interference

5Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007
Is OS Noise a problem? When?Introduction to Operating System Interference
Effect of Noise Duration on Allreduce()
P. Beckman, K. Iskra, K. Yoshii, S. Coghlan, A. Nataraj“Benchmarking the Effects of OS Interference on Extreme-Scale Parallel Machines”Cluster Computing Journal ’07, Special Issue from CLUSTER’06 (to appear).
Noise Ratio (or %Noise)Not good indicator by itself
Noise-event DurationSignificantly influences effect
=> Allreduce() tight-loop=> %Noise Fixed: 0.15%=> Increase Noise-Event Duration

6Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007
Is OS Noise a problem? When?Introduction to Operating System Interference
Parallel Ocean Program (POP) - Effect of Noise at Scale
Yes. Noise does affect (some) real applications at scale.
Different parts of application have different sensitivities to noise
Total: 23% @ 1024 | 32% @ 2048
Baroclinic: 1% @ 1024 | 3% @ 2048
Barotropic: 75% @ 1024 | 86% @ 2048 POP, 1x Problem
BG/L, VN Mode
P = 1024, 2048
Noise: 1000HZ, 16 usec 1.6% Noise
Total =Baroclinic + BarotropicP
erce
nt I
ncre
ase
in R
unti
me
No. of Processors

7Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007
Motivating a Direct Measurement Approach to Noise Estimation OS/R Interference of Parallel Applications is a complex phenomenon
Depends on the underlying platform (e.g. interconnect latencies, [*others?]) Depends on the OS/R configuration & resulting noise distributions Depends on parallel application behavior (comp. grain, global interactions)
Simulation, Emulation, Micro-benchmarking, Modeling [*what else?] All important parts of the performance tool-box But need to be carefully parameterized to match the above dependencies Not all complex parallel applications have been modeled And models/simulations can be too simple or wrongly parameterized Micro-benchmarks do not capture application complexity (intentionally)
What is App.X’s delay due to existing/intrinsic noise on Platform.Y? Need general measurement technique to answer that question Complementary to the other approaches in the tool-box
Introduction to Operating System Interference

8Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007
Who cares about noise? Their concerns? System administrators / Site Managers:
What are noise characteristics of new tera/peta-scale platform we are purchasing?
Do the site’s “champion” applications suffer significant noise-slowdowns on it? Can these effects be mitigated by changes to configuration (OS/R, H/W)?
Application Developers: Is my parallel application sensitive to noise? On a particular platform? Can I change application characteristics to reduce that sensitivity? What parts and call-sequences are most sensitive?
OS/R Developers: What are the noise sources in the OS/R? What are the effects of different sources? Do all need to be removed?
Not all are comparative - cannot simply look at difference in runtime No “normal” (or baseline) noise-less candidate to compare with
Introduction to Operating System Interference

9Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007
Our Key Contribution A general noise-effect estimation technique based on direct measurement To answer the performance questions
Q.1: Where is the OS/R noise on Platform.Y originating from (sources)? Q.2: What is App.X’s delay D, due to existing OS/R Interference on Platform.Y? Q.3: How do different OS/R noise sources contribute to delay D? Q.4: Which parts (operations, call-seq) of the applications are sensitive?
Applies to any X & Y, (as long as X & Y allow the needed direct measurement) Without requiring “baseline” or “noise-less” candidate. Why?
If “noiseless” candidate exists ==> then where is the noise-problem? A “baseline” will make answering Q.2 trivial. No special technique. Just
Runtime. Can only answer comparative questions - effect of adding/removing noise.
Wait a sec... Can’t existing work/methods already answer these questions? For Q.1, yes. But no general approach for Q.2 - Q.4. Lets look at some.
Introduction to Operating System Interference

10
Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007
Contrasting with the State-of-the-Art Petrini et. al. SC’03 - Micro-benchmark, simulation, modeling to close the loop
Very specific to application (SAGE). Accurate model difficult to produce. Gioiosa et. al. ISSIPIT’04 - Microbenchmark & measurement
Identify noise sources using OProfile sampling. Quantifies only local-noise. Agarwal et. al. HiPC’05 - Modeling
Theoretical model used to study noise effects under different distributions Assumptions include: Balanced Load, Stationary, Balanced Noise, Identical noise
Beckman et. al. CLUSTER’06 - Noise emulation Injected artificial noise into micro-benchmarks to understand effects Approach not applicable to measurement of real, existing noise
Sottile et. al. IPDPS’06 - Trace-based Noise Simulation Modify application traces by adding artificial noise. Compare effect on runtime.
Nataraj et. al. CLUSTER’06 / Europar’06 - Integrated Measurement of OS / Application using KTAU / TAU. Detect & Quantify Per-Process OS interactions.
Introduction to Operating System Interference

11
Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007
Part B

12
Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007
The (K)TAU Performance System [* TODO] What is TAU? Features? Who uses it? What is KTAU? Why? ZeptoOS? Tight Integration with TAU - why?
Introduction to Operating System Interference

13
Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007
KTAU OS Metrics - How it works...
schedule
timer_interrupt
schedule
timer_interrupt
sys_read
MPI Rank 0
Application
User
Kernel
KTAU PerformanceState
PID: 1423
User-level Double-Buffered Metric Container
On schedule() update counter
get_shared_counter()
14
schedule
timer_interrupt
schedule
timer_interrupt
sys_read
MPI Rank 1
Application
User
KTAU PerformanceState
PID: 1430
User-level Double-Buffered Metric Container
On timer_interrupt update counter
get_shared_counter()
sys_writedo_IRQ

14
Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007
Performance of KTAU OS Metrics & TAU MPI Tracing

15
Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007
Integrated User/OS ProfilingVirualized OS Metrics Per-MPI Rank | Detects & Quantifies OS/R Interactions

16
Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007
Integrated User/OS Phase Profiling [* TODO] Can attribute OS Noise to specific Phases in MPI Ranks Again Can detect & quantify, but at finer grain within phases. ToDo collect data on garuda (or neuronic) easy to do.

17
Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007
Noise-Effect Estimation Technique Outlined - Example A
timeMeasured Timeline
Approximated Timeline
Noise Amplification with Wait time
Process 1 Sends; Process 2 Receives
Noise-Effect Estimation Technique Outlined

18
Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007
Noise-Effect Estimation Technique Outlined - Example B
timeMeasured Timeline
Approximated Timeline
Noise Absorption with Wait time
Process 1 Sends; Process 2 Receives
Noise-Effect Estimation Technique Outlined

19
Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007
Noise-Effect Estimation Technique Outlined - Example C Multiple Sources - Combined Noise-Effect Have the pic. must place here. ToDo

20
Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007
Prior Work on Trace-based Timeline Approximation Wolf, Malony, Shende, Morris | HPCC’06
Use timeline approximation from application traces. Context: Measurement perturbation compensation, not noise-effect estimation.
Sottile, Chandu, Bader | IPDPS’06 Trace-based simulation Inject artificial noise into existing application traces Reverse of our current work - we remove real noise from the trace.
Current work inspired by and builds upon the above two works

21
Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007
[* ToDo]
Part C