aroon nataraj, alan morris, allen malony, matthew sottile, pete beckman l {anataraj, amorris,...

21
Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckman l {anataraj, amorris, malony, matt}@cs.uoregon.edu Department of Computer and Information Science University of Oregon [email protected] l Mathematics and Computer Science Division Argonne National Laboratory Observing the Effects of Kernel Operation on Parallel Application Performance

Upload: corey-webb

Post on 16-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckman l {anataraj, amorris, malony, matt}@cs.uoregon.edu Department of Computer and Information

Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckmanl

{anataraj, amorris, malony, matt}@cs.uoregon.edu

Department of Computer and Information Science

University of Oregon

[email protected]

Mathematics and Computer Science Division

Argonne National Laboratory

The Ghost in the Machine Observing the Effects of Kernel Operation on

Parallel Application Performance

Page 2: Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckman l {anataraj, amorris, malony, matt}@cs.uoregon.edu Department of Computer and Information

2Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007

Outline of the Talk Part A: Operating System Interference & Us

Introduction to OS/R Interference Research Questions & Our Contribution Contrasting to State-of-the-Art Noise Measurement

Part B The (K)TAU Performance System Fine-grained OS/App performance Correlation The Noise-Effect Estimation Technique Described

Part C Evaluating accuracy of Noise-Estimation @ Scale Demonstrating Noise-Estimation Conclusion & Future Plans

Page 3: Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckman l {anataraj, amorris, malony, matt}@cs.uoregon.edu Department of Computer and Information

3Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007

Parallel Applications’ Noise PropagationIntroduction to Operating System Interference

Waiting timedue to OS

Overheadaccumulates

Local Cycles: 3Global Cycles: 9

Compute

Compute

Compute

w

w

Col

lect

ive

Compute

Compute

Compute

w

w

Col

lect

ive

Compute

Compute

Compute

w

w

Col

lect

ive

os

os

os

Simple Parallel Application

Phases of Alternating Communication & Reduction

What is the cause of variability?

Page 4: Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckman l {anataraj, amorris, malony, matt}@cs.uoregon.edu Department of Computer and Information

4Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007

Is OS Noise a problem? When? Earlier work has shown significant OS/R interference problems

Petrini et. al. SC’03 ASCI Q, P=4,096 | SAGE took twice as long predicted by performance model

Jones et. al. SC’03, IBM SP | Degraded linear scaling performance of allreduce()

Cause identified as OS/R Interference Noise Distribution Matters

Agarwal et. al. HIPC’05 - Use theoretical approach Heavy-tailed & Bernoulli noise most detrimental

Beckman et. al. CCJ’07 - Noise Emulation (Injection) Max Noise duration, for some noise-ratio, determines noise effects on collectives

Our own noise-injection experiments show noise can be a problem

Introduction to Operating System Interference

Page 5: Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckman l {anataraj, amorris, malony, matt}@cs.uoregon.edu Department of Computer and Information

5Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007

Is OS Noise a problem? When?Introduction to Operating System Interference

Effect of Noise Duration on Allreduce()

P. Beckman, K. Iskra, K. Yoshii, S. Coghlan, A. Nataraj“Benchmarking the Effects of OS Interference on Extreme-Scale Parallel Machines”Cluster Computing Journal ’07, Special Issue from CLUSTER’06 (to appear).

Noise Ratio (or %Noise)Not good indicator by itself

Noise-event DurationSignificantly influences effect

=> Allreduce() tight-loop=> %Noise Fixed: 0.15%=> Increase Noise-Event Duration

Page 6: Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckman l {anataraj, amorris, malony, matt}@cs.uoregon.edu Department of Computer and Information

6Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007

Is OS Noise a problem? When?Introduction to Operating System Interference

Parallel Ocean Program (POP) - Effect of Noise at Scale

Yes. Noise does affect (some) real applications at scale.

Different parts of application have different sensitivities to noise

Total: 23% @ 1024 | 32% @ 2048

Baroclinic: 1% @ 1024 | 3% @ 2048

Barotropic: 75% @ 1024 | 86% @ 2048 POP, 1x Problem

BG/L, VN Mode

P = 1024, 2048

Noise: 1000HZ, 16 usec 1.6% Noise

Total =Baroclinic + BarotropicP

erce

nt I

ncre

ase

in R

unti

me

No. of Processors

Page 7: Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckman l {anataraj, amorris, malony, matt}@cs.uoregon.edu Department of Computer and Information

7Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007

Motivating a Direct Measurement Approach to Noise Estimation OS/R Interference of Parallel Applications is a complex phenomenon

Depends on the underlying platform (e.g. interconnect latencies, [*others?]) Depends on the OS/R configuration & resulting noise distributions Depends on parallel application behavior (comp. grain, global interactions)

Simulation, Emulation, Micro-benchmarking, Modeling [*what else?] All important parts of the performance tool-box But need to be carefully parameterized to match the above dependencies Not all complex parallel applications have been modeled And models/simulations can be too simple or wrongly parameterized Micro-benchmarks do not capture application complexity (intentionally)

What is App.X’s delay due to existing/intrinsic noise on Platform.Y? Need general measurement technique to answer that question Complementary to the other approaches in the tool-box

Introduction to Operating System Interference

Page 8: Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckman l {anataraj, amorris, malony, matt}@cs.uoregon.edu Department of Computer and Information

8Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007

Who cares about noise? Their concerns? System administrators / Site Managers:

What are noise characteristics of new tera/peta-scale platform we are purchasing?

Do the site’s “champion” applications suffer significant noise-slowdowns on it? Can these effects be mitigated by changes to configuration (OS/R, H/W)?

Application Developers: Is my parallel application sensitive to noise? On a particular platform? Can I change application characteristics to reduce that sensitivity? What parts and call-sequences are most sensitive?

OS/R Developers: What are the noise sources in the OS/R? What are the effects of different sources? Do all need to be removed?

Not all are comparative - cannot simply look at difference in runtime No “normal” (or baseline) noise-less candidate to compare with

Introduction to Operating System Interference

Page 9: Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckman l {anataraj, amorris, malony, matt}@cs.uoregon.edu Department of Computer and Information

9Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007

Our Key Contribution A general noise-effect estimation technique based on direct measurement To answer the performance questions

Q.1: Where is the OS/R noise on Platform.Y originating from (sources)? Q.2: What is App.X’s delay D, due to existing OS/R Interference on Platform.Y? Q.3: How do different OS/R noise sources contribute to delay D? Q.4: Which parts (operations, call-seq) of the applications are sensitive?

Applies to any X & Y, (as long as X & Y allow the needed direct measurement) Without requiring “baseline” or “noise-less” candidate. Why?

If “noiseless” candidate exists ==> then where is the noise-problem? A “baseline” will make answering Q.2 trivial. No special technique. Just

Runtime. Can only answer comparative questions - effect of adding/removing noise.

Wait a sec... Can’t existing work/methods already answer these questions? For Q.1, yes. But no general approach for Q.2 - Q.4. Lets look at some.

Introduction to Operating System Interference

Page 10: Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckman l {anataraj, amorris, malony, matt}@cs.uoregon.edu Department of Computer and Information

10

Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007

Contrasting with the State-of-the-Art Petrini et. al. SC’03 - Micro-benchmark, simulation, modeling to close the loop

Very specific to application (SAGE). Accurate model difficult to produce. Gioiosa et. al. ISSIPIT’04 - Microbenchmark & measurement

Identify noise sources using OProfile sampling. Quantifies only local-noise. Agarwal et. al. HiPC’05 - Modeling

Theoretical model used to study noise effects under different distributions Assumptions include: Balanced Load, Stationary, Balanced Noise, Identical noise

Beckman et. al. CLUSTER’06 - Noise emulation Injected artificial noise into micro-benchmarks to understand effects Approach not applicable to measurement of real, existing noise

Sottile et. al. IPDPS’06 - Trace-based Noise Simulation Modify application traces by adding artificial noise. Compare effect on runtime.

Nataraj et. al. CLUSTER’06 / Europar’06 - Integrated Measurement of OS / Application using KTAU / TAU. Detect & Quantify Per-Process OS interactions.

Introduction to Operating System Interference

Page 11: Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckman l {anataraj, amorris, malony, matt}@cs.uoregon.edu Department of Computer and Information

11

Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007

Part B

Page 12: Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckman l {anataraj, amorris, malony, matt}@cs.uoregon.edu Department of Computer and Information

12

Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007

The (K)TAU Performance System [* TODO] What is TAU? Features? Who uses it? What is KTAU? Why? ZeptoOS? Tight Integration with TAU - why?

Introduction to Operating System Interference

Page 13: Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckman l {anataraj, amorris, malony, matt}@cs.uoregon.edu Department of Computer and Information

13

Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007

KTAU OS Metrics - How it works...

schedule

timer_interrupt

schedule

timer_interrupt

sys_read

MPI Rank 0

Application

User

Kernel

KTAU PerformanceState

PID: 1423

User-level Double-Buffered Metric Container

On schedule() update counter

get_shared_counter()

14

schedule

timer_interrupt

schedule

timer_interrupt

sys_read

MPI Rank 1

Application

User

KTAU PerformanceState

PID: 1430

User-level Double-Buffered Metric Container

On timer_interrupt update counter

get_shared_counter()

sys_writedo_IRQ

Page 14: Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckman l {anataraj, amorris, malony, matt}@cs.uoregon.edu Department of Computer and Information

14

Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007

Performance of KTAU OS Metrics & TAU MPI Tracing

Page 15: Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckman l {anataraj, amorris, malony, matt}@cs.uoregon.edu Department of Computer and Information

15

Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007

Integrated User/OS ProfilingVirualized OS Metrics Per-MPI Rank | Detects & Quantifies OS/R Interactions

Page 16: Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckman l {anataraj, amorris, malony, matt}@cs.uoregon.edu Department of Computer and Information

16

Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007

Integrated User/OS Phase Profiling [* TODO] Can attribute OS Noise to specific Phases in MPI Ranks Again Can detect & quantify, but at finer grain within phases. ToDo collect data on garuda (or neuronic) easy to do.

Page 17: Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckman l {anataraj, amorris, malony, matt}@cs.uoregon.edu Department of Computer and Information

17

Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007

Noise-Effect Estimation Technique Outlined - Example A

timeMeasured Timeline

Approximated Timeline

Noise Amplification with Wait time

Process 1 Sends; Process 2 Receives

Noise-Effect Estimation Technique Outlined

Page 18: Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckman l {anataraj, amorris, malony, matt}@cs.uoregon.edu Department of Computer and Information

18

Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007

Noise-Effect Estimation Technique Outlined - Example B

timeMeasured Timeline

Approximated Timeline

Noise Absorption with Wait time

Process 1 Sends; Process 2 Receives

Noise-Effect Estimation Technique Outlined

Page 19: Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckman l {anataraj, amorris, malony, matt}@cs.uoregon.edu Department of Computer and Information

19

Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007

Noise-Effect Estimation Technique Outlined - Example C Multiple Sources - Combined Noise-Effect Have the pic. must place here. ToDo

Page 20: Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckman l {anataraj, amorris, malony, matt}@cs.uoregon.edu Department of Computer and Information

20

Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007

Prior Work on Trace-based Timeline Approximation Wolf, Malony, Shende, Morris | HPCC’06

Use timeline approximation from application traces. Context: Measurement perturbation compensation, not noise-effect estimation.

Sottile, Chandu, Bader | IPDPS’06 Trace-based simulation Inject artificial noise into existing application traces Reverse of our current work - we remove real noise from the trace.

Current work inspired by and builds upon the above two works

Page 21: Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckman l {anataraj, amorris, malony, matt}@cs.uoregon.edu Department of Computer and Information

21

Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007

[* ToDo]

Part C