early experiences with ktau on the blue gene / l a. nataraj, a. malony, a. morris, s. shende...

29
Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

Upload: bertram-quinn

Post on 05-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

Early Experiences with KTAU on the Blue Gene / L

A. Nataraj, A. Malony, A. Morris, S. Shende

Performance Research Lab

University of Oregon

Page 2: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

Outline Introduction Motivations Objectives Architecture KTAU on Blue Gene / L Ongoing / Recent work Future work and directions Acknowledgements References Team

Page 3: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

Introduction : ZeptoOS and TAU DOE OS/RTS for Extreme Scale Scientific

Computation(Fastos) Conduct OS research to provide effective OS/Runtime for

petascale systems

ZeptoOS (under Fastos) Scalable components for petascale architectures Joint project Argonne National Lab and University of Oregon ANL: Putting light-weight kernel (based on Linux) on BG/L and

other platforms (XT3)

University of Oregon Kernel performance monitoring, tuning KTAU

Integration of TAU infrastructure in Linux Kernel Integration with ZeptoOS, installation on BG/L Port to 32-bit and 64-bit Linux platforms

Page 4: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

ZeptoOS: The Small Linux for Big Computers Research Exploration

What are the fundamental limits and advanced designs required for petascale Operating System Suites

Behaviour at large scales Management & optimization of OS suites Collectives Fault tolerance Measurement, collection and analysis of OS performance data from large

number of nodes

Strategy Modified Linux on BG/L I/O nodes

Measure and understand behavior Modified Linux for BG/L compute nodes

Measure and understand behavior Specialized I/O daemon on I/O node (ZOID)

Measure and understand behavior

(ZeptoOS BG/L Symposium presentation slide reused with permission from Pete Beckman [beckman06-bgl])

Page 5: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

ZeptoOS and KTAU Lots of fine-grained OS measurement is required for each

component of the ZeptoOS work

Exactly what aspects of Linux need to be changed to achieve ZeptoOS goals?

How and why do the various OS source and configuration changes affect parallel applications?

How do we correlate performance data between the parallel application, the compute node OS, the I/O Daemon and the I/O Node OS

Enter TAU/KTAU - An integrated methodology and framework to measure performance of applications and OS kernel across a system like BG/L.

Page 6: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

Motivation Application Performance

user-level execution performance + OS-level operations performance

Domains: Time and Hardware Performance Metrics

PAPI (Performance Application Programming Interface) Exposes virtualized hardware counters

TAU (Tuning and Analysis Utility) Measures most user-level entities: parallel application, MPI, libraries … Time domain Uses PAPI to correlate counter information to source

But how to correlate OS-level influences with App. Performance?

Page 7: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

As HPC systems continue to scale to larger processor counts Application performance more sensitive New OS factors become performance bottlenecks (E.g. [Petrini’03, Jones’03,

other works…]) Isolating these system-level issues as bottlenecks is non-trivial

Require Comprehensive Performance Understanding Observation of all performance factors Relative contributions and interrelationship Can we correlate?

Motivation (continued)

Page 8: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

Motivation (continued)Program - OS Interactions Program OS Interactions - Direct vs. Indirect Entry Points

Direct - Applications invoke the OS for certain services Syscalls (and internal OS routines called directly from syscalls)

Indirect - OS takes actions without explicit invocation by application Preemptive Scheduling (HW) Interrupt handling OS-background activity (keeping track of time and timers, bottom-half

handling, etc) Indirect interactions can occur at any OS entry (not just when entering

through Syscalls)

Direct Interactions easier to handle Synchronous with user-code and in process-context

Indirect Interactions more difficult to handle Usually asynchronous and in interrupt-context: Hard to measure and

harder to correlate/integrate with app. measurements

Page 9: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

Motivation (continued)Kernel-wide vs. Process-centric Kernel-wide - Aggregate kernel activity of all active

processes in system Understand overall OS behavior, identify and remove kernel

hot spots. Cannot show what parts of app. spend time in OS and why

Process-centric perspective - OS performance within context of a specific application’s execution Virtualization and Mapping performance to process Interactions between programs, daemons, and system services Tune OS for specific workload or tune application to better

conform to OS config. Expose real source of performance problems (in the OS or the

application)

Page 10: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

Motivation (continued)Existing Approaches User-space Only measurement tools

Many tools only work at user-level and cannot observe system-level performance influences

Kernel-level Only measurement tools Most only provide the kernel-wide perspective – lack proper

mapping/virtualization Some provide process-centric views but cannot integrate OS and

user-level measurements Combined or Integrated User/Kernel Measurement Tools

A few powerful tools allow fine-grained measurement and correlation of kernel and user-level performance

Typically these focus only on Direct OS interactions. Indirect interactions not merged.

Using Combinations of above tools Without better integration, does not allow fine-grained correlation

between OS and App. Many kernel tools do not explicitly recognize Parallel workloads

(e.g. MPI ranks) Need an integrated approach to parallel perf. observation, analyses

Page 11: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

High-Level Objectives

Support low-overhead OS performance measurement at multiple levels of function and detail

Provide both kernel-wide and process-centric perspectives of OS performance

Merge user-level and kernel-level performance information across all program-OS interactions

Provide online information and the ability to function without a daemon where possible

Support both profiling and tracing for kernel-wide and process-centric views in parallel systems

Leverage existing parallel performance analysis tools Support for observing, collecting and analyzing parallel data

Page 12: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

KTAU: Outline Introduction Motivations Objectives Architecture KTAU on Blue Gene / L Recent/Ongoing Work (since publication) Future work and directions Acknowledgements References Team

Page 13: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

KTAU Architecture

Page 14: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

KTAU On BGL’s ZeptoOS

I/O Node Open source modified Linux Kernel (2.4, 2.6) - ZeptoOS Control I/O Daemon (CIOD) handles I/O syscalls from Compute

nodes in pset.

Compute Node IBM proprietary (closed-source) light-weight kernel No scheduling or virtual memory support Forwards I/O syscalls to CIOD on I/O node

KTAU on I/O Node: Integrated into ZeptoOS config and build system. Require KTAU-D (daemon) as CIOD is closed-source. KTAU-D periodically monitors sys-wide or individual process

Visualization of trace/profile of ZeptoOS, CIOD using Paraprof, Vampir/Jumpshot.

Page 15: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

KTAU On BG/L (current)

Page 16: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

On BG/L (continued)

Early ExperiencesCIOD Kernel Trace zoomed-in (running iotest benchmark)

Page 17: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

On BG/L (continued)Early Experiences

Page 18: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

On BG/L (continued)Early Experiences

Correlating CIOD and RPC-IOD Activity

Page 19: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

KTAU On BG/L Will Eventually Look Like …

Replace with: ZOID + TAU

Replace with: Linux + KTAU

Page 20: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

Ongoing/Recent Work (since publication)

Accurate Identification of “noise” sources Modified Linux on BG/L should not take a performance loss One area of concern - OS “noise” effects on Synchronization /

Collectives Requires identifying exactly what aspects (code paths,

configurations, devices attached) of the OS induce what types of interference

This will require user-level as well as OS measurement

Our Approach Use the Selfish benchmark [Beckman06] to identify “detours” (or noise

events) in user-space This shows durations and frequencies of events, but NOT cause/source. Simultaneously use KTAU OS-tracing to record OS activity Correlate time of occurrence (both use same time source - hw time counter) Infer which type of OS-activity (if any) caused the “detour”

Remove or alleviate interference using above information (Work-in-progress)

Page 21: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

Ongoing/Recent Work (continued)

“Noise” Source IdentificationBGL IO-N: Merged OS/User Performance View of Scheduling

Page 22: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

Ongoing/Recent Work (continued)

“Noise” Source IdentificationMerged OS/User View of OS Background Activity

Page 23: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

Ongoing/Recent Work (continued)

“Noise” Source IdentificationZoomed-In: Merged OS/User View of OS Background Activity

Page 24: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

Future Work Dynamic measurement control - enable/disable events w/o

recompilation or reboot Improve performance data sources that KTAU can access - E.g.

PAPI

Improve integration with TAU’s user-space capabilities to provide even better correlation of user and kernel performance information full callpaths, phase-based profiling, merged user/kernel traces (already available)

Integration of Tau, Ktau with Supermon

Porting efforts: IA-64, PPC-64 and AMD Opteron

ZeptoOS: Planned characterization efforts BGL I/O node Dynamically adaptive kernels

Page 25: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

Support Acknowledgements

Department of Energy’s Office of Science (contract no. DE-FG02-05ER25663) and

National Science Foundation (grant no. NSF CCF 0444475)

Page 26: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

References

[petrini’03]:F. Petrini, D. J. Kerbyson, and S. Pakin, “The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of asci q,” in SC ’03

[jones’03]: T. Jones and et al., “Improving the scalability of parallel jobs by adding parallel awareness to the operating system,” in SC ’03

[PAPI]: S. Browne et al., “A Portable Programming Interface for Performance Evaluation on Modern Processors”. The International Journal of High Performance Computing Applications, 14(3):189--204, Fall 2000.

[VAMPIR]: W. E. Nagel et. al., “VAMPIR: Visualization and analysis of MPI resources,” Supercomputer, vol. 12, no. 1, pp. 69–80, 1996.

[ZeptoOS]: “ZeptoOS: The small linux for big computers,” http://www.mcs.anl.gov/zeptoos/

[NPB]: D.H. Bailey et. al., “The nas parallel benchmarks,” The International Journal of Supercomputer Applications, vol. 5, no. 3, pp. 63–73, Fall 1991.

Page 27: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

References

[Sweep3d]: A. Hoise et. al., “A general predictive performance model for wavefront algorithms on clusters of SMPs,” in International Conference on Parallel Processing, 2000

[LMBENCH]: L. W. McVoy and C. Staelin, “lmbench: Portable tools for performance analysis,” in USENIX Annual Technical Conference, 1996, pp. 279–294

[TAU]: “TAU: Tuning and Analysis Utilities,” http://www.cs.uoregon.edu/research/paracomp/tau/

[KTAU-BGL]: A. Nataraj, A. Malony, A. Morris, and S. Shende, “Early experiences with ktau on the ibm bg/l,” in EuroPar’06, European Conference on Parallel Processing, 2006.

[KTAU]: A. Nataraj et al., “Kernel-Level Measurement for Integrated Parallel Performance Views: the KTAU Project” in IEEE Cluster-2006 (Best Paper)

Page 28: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

TeamUniversity of Oregon (UO) Core Team Aroon Nataraj, PhD Student Prof. Allen D Malony Dr. Sameer Shende, Senior Scientist Alan Morris, Senior Software Engineer

Argonne National Lab (ANL) Contributors Pete Beckman Kamil Iskra Kazutomo Yoshii

Past Members Suravee Suthikulpanit , MS Student, UO, (Graduated)

Page 29: Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

Thank You

Questions?

Comments?

Feedback?