early experiences with ktau on the blue gene / l
DESCRIPTION
Early Experiences with KTAU on the Blue Gene / L. A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon. Outline. Introduction Motivations Objectives Architecture KTAU on Blue Gene / L Ongoing / Recent work Future work and directions Acknowledgements - PowerPoint PPT PresentationTRANSCRIPT
Early Experiences with KTAU on the Blue Gene / L
A. Nataraj, A. Malony, A. Morris, S. Shende
Performance Research Lab
University of Oregon
Outline Introduction Motivations Objectives Architecture KTAU on Blue Gene / L Ongoing / Recent work Future work and directions Acknowledgements References Team
Introduction : ZeptoOS and TAU DOE OS/RTS for Extreme Scale Scientific
Computation(Fastos) Conduct OS research to provide effective OS/Runtime for
petascale systems
ZeptoOS (under Fastos) Scalable components for petascale architectures Joint project Argonne National Lab and University of Oregon ANL: Putting light-weight kernel (based on Linux) on BG/L and
other platforms (XT3)
University of Oregon Kernel performance monitoring, tuning KTAU
Integration of TAU infrastructure in Linux Kernel Integration with ZeptoOS, installation on BG/L Port to 32-bit and 64-bit Linux platforms
ZeptoOS: The Small Linux for Big Computers Research Exploration
What are the fundamental limits and advanced designs required for petascale Operating System Suites
Behaviour at large scales Management & optimization of OS suites Collectives Fault tolerance Measurement, collection and analysis of OS performance data from large
number of nodes
Strategy Modified Linux on BG/L I/O nodes
Measure and understand behavior Modified Linux for BG/L compute nodes
Measure and understand behavior Specialized I/O daemon on I/O node (ZOID)
Measure and understand behavior
(ZeptoOS BG/L Symposium presentation slide reused with permission from Pete Beckman [beckman06-bgl])
ZeptoOS and KTAU Lots of fine-grained OS measurement is required for each
component of the ZeptoOS work
Exactly what aspects of Linux need to be changed to achieve ZeptoOS goals?
How and why do the various OS source and configuration changes affect parallel applications?
How do we correlate performance data between the parallel application, the compute node OS, the I/O Daemon and the I/O Node OS
Enter TAU/KTAU - An integrated methodology and framework to measure performance of applications and OS kernel across a system like BG/L.
Motivation Application Performance
user-level execution performance + OS-level operations performance
Domains: Time and Hardware Performance Metrics
PAPI (Performance Application Programming Interface) Exposes virtualized hardware counters
TAU (Tuning and Analysis Utility) Measures most user-level entities: parallel application, MPI, libraries … Time domain Uses PAPI to correlate counter information to source
But how to correlate OS-level influences with App. Performance?
As HPC systems continue to scale to larger processor counts Application performance more sensitive New OS factors become performance bottlenecks (E.g. [Petrini’03, Jones’03,
other works…]) Isolating these system-level issues as bottlenecks is non-trivial
Require Comprehensive Performance Understanding Observation of all performance factors Relative contributions and interrelationship Can we correlate?
Motivation (continued)
Motivation (continued)Program - OS Interactions Program OS Interactions - Direct vs. Indirect Entry Points
Direct - Applications invoke the OS for certain services Syscalls (and internal OS routines called directly from syscalls)
Indirect - OS takes actions without explicit invocation by application Preemptive Scheduling (HW) Interrupt handling OS-background activity (keeping track of time and timers, bottom-half
handling, etc) Indirect interactions can occur at any OS entry (not just when entering
through Syscalls)
Direct Interactions easier to handle Synchronous with user-code and in process-context
Indirect Interactions more difficult to handle Usually asynchronous and in interrupt-context: Hard to measure and
harder to correlate/integrate with app. measurements
Motivation (continued)Kernel-wide vs. Process-centric Kernel-wide - Aggregate kernel activity of all active
processes in system Understand overall OS behavior, identify and remove kernel
hot spots. Cannot show what parts of app. spend time in OS and why
Process-centric perspective - OS performance within context of a specific application’s execution Virtualization and Mapping performance to process Interactions between programs, daemons, and system services Tune OS for specific workload or tune application to better
conform to OS config. Expose real source of performance problems (in the OS or the
application)
Motivation (continued)Existing Approaches User-space Only measurement tools
Many tools only work at user-level and cannot observe system-level performance influences
Kernel-level Only measurement tools Most only provide the kernel-wide perspective – lack proper
mapping/virtualization Some provide process-centric views but cannot integrate OS and
user-level measurements Combined or Integrated User/Kernel Measurement Tools
A few powerful tools allow fine-grained measurement and correlation of kernel and user-level performance
Typically these focus only on Direct OS interactions. Indirect interactions not merged.
Using Combinations of above tools Without better integration, does not allow fine-grained correlation
between OS and App. Many kernel tools do not explicitly recognize Parallel workloads
(e.g. MPI ranks) Need an integrated approach to parallel perf. observation, analyses
High-Level Objectives
Support low-overhead OS performance measurement at multiple levels of function and detail
Provide both kernel-wide and process-centric perspectives of OS performance
Merge user-level and kernel-level performance information across all program-OS interactions
Provide online information and the ability to function without a daemon where possible
Support both profiling and tracing for kernel-wide and process-centric views in parallel systems
Leverage existing parallel performance analysis tools Support for observing, collecting and analyzing parallel data
KTAU: Outline Introduction Motivations Objectives Architecture KTAU on Blue Gene / L Recent/Ongoing Work (since publication) Future work and directions Acknowledgements References Team
KTAU Architecture
KTAU On BGL’s ZeptoOS
I/O Node Open source modified Linux Kernel (2.4, 2.6) - ZeptoOS Control I/O Daemon (CIOD) handles I/O syscalls from Compute
nodes in pset.
Compute Node IBM proprietary (closed-source) light-weight kernel No scheduling or virtual memory support Forwards I/O syscalls to CIOD on I/O node
KTAU on I/O Node: Integrated into ZeptoOS config and build system. Require KTAU-D (daemon) as CIOD is closed-source. KTAU-D periodically monitors sys-wide or individual process
Visualization of trace/profile of ZeptoOS, CIOD using Paraprof, Vampir/Jumpshot.
KTAU On BG/L (current)
On BG/L (continued)
Early ExperiencesCIOD Kernel Trace zoomed-in (running iotest benchmark)
On BG/L (continued)Early Experiences
On BG/L (continued)Early Experiences
Correlating CIOD and RPC-IOD Activity
KTAU On BG/L Will Eventually Look Like …
Replace with: ZOID + TAU
Replace with: Linux + KTAU
Ongoing/Recent Work (since publication)
Accurate Identification of “noise” sources Modified Linux on BG/L should not take a performance loss One area of concern - OS “noise” effects on Synchronization /
Collectives Requires identifying exactly what aspects (code paths,
configurations, devices attached) of the OS induce what types of interference
This will require user-level as well as OS measurement
Our Approach Use the Selfish benchmark [Beckman06] to identify “detours” (or noise
events) in user-space This shows durations and frequencies of events, but NOT cause/source. Simultaneously use KTAU OS-tracing to record OS activity Correlate time of occurrence (both use same time source - hw time counter) Infer which type of OS-activity (if any) caused the “detour”
Remove or alleviate interference using above information (Work-in-progress)
Ongoing/Recent Work (continued)
“Noise” Source IdentificationBGL IO-N: Merged OS/User Performance View of Scheduling
Ongoing/Recent Work (continued)
“Noise” Source IdentificationMerged OS/User View of OS Background Activity
Ongoing/Recent Work (continued)
“Noise” Source IdentificationZoomed-In: Merged OS/User View of OS Background Activity
Future Work Dynamic measurement control - enable/disable events w/o
recompilation or reboot Improve performance data sources that KTAU can access - E.g.
PAPI
Improve integration with TAU’s user-space capabilities to provide even better correlation of user and kernel performance information full callpaths, phase-based profiling, merged user/kernel traces (already available)
Integration of Tau, Ktau with Supermon
Porting efforts: IA-64, PPC-64 and AMD Opteron
ZeptoOS: Planned characterization efforts BGL I/O node Dynamically adaptive kernels
Support Acknowledgements
Department of Energy’s Office of Science (contract no. DE-FG02-05ER25663) and
National Science Foundation (grant no. NSF CCF 0444475)
References
[petrini’03]:F. Petrini, D. J. Kerbyson, and S. Pakin, “The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of asci q,” in SC ’03
[jones’03]: T. Jones and et al., “Improving the scalability of parallel jobs by adding parallel awareness to the operating system,” in SC ’03
[PAPI]: S. Browne et al., “A Portable Programming Interface for Performance Evaluation on Modern Processors”. The International Journal of High Performance Computing Applications, 14(3):189--204, Fall 2000.
[VAMPIR]: W. E. Nagel et. al., “VAMPIR: Visualization and analysis of MPI resources,” Supercomputer, vol. 12, no. 1, pp. 69–80, 1996.
[ZeptoOS]: “ZeptoOS: The small linux for big computers,” http://www.mcs.anl.gov/zeptoos/
[NPB]: D.H. Bailey et. al., “The nas parallel benchmarks,” The International Journal of Supercomputer Applications, vol. 5, no. 3, pp. 63–73, Fall 1991.
References
[Sweep3d]: A. Hoise et. al., “A general predictive performance model for wavefront algorithms on clusters of SMPs,” in International Conference on Parallel Processing, 2000
[LMBENCH]: L. W. McVoy and C. Staelin, “lmbench: Portable tools for performance analysis,” in USENIX Annual Technical Conference, 1996, pp. 279–294
[TAU]: “TAU: Tuning and Analysis Utilities,” http://www.cs.uoregon.edu/research/paracomp/tau/
[KTAU-BGL]: A. Nataraj, A. Malony, A. Morris, and S. Shende, “Early experiences with ktau on the ibm bg/l,” in EuroPar’06, European Conference on Parallel Processing, 2006.
[KTAU]: A. Nataraj et al., “Kernel-Level Measurement for Integrated Parallel Performance Views: the KTAU Project” in IEEE Cluster-2006 (Best Paper)
TeamUniversity of Oregon (UO) Core Team Aroon Nataraj, PhD Student Prof. Allen D Malony Dr. Sameer Shende, Senior Scientist Alan Morris, Senior Software Engineer
Argonne National Lab (ANL) Contributors Pete Beckman Kamil Iskra Kazutomo Yoshii
Past Members Suravee Suthikulpanit , MS Student, UO, (Graduated)
Thank You
Questions?
Comments?
Feedback?