fast-os bof sc 04 fastos follow link to subscribe to the mail list
TRANSCRIPT
FAST-OS BOF SC 04
http://www.cs.unm.edu/~fastos
Follow link to subscribe to the mail list
Projects
• ColonyTerry Jones, LLNL
• Config FrameworkRon Brightwell, SNL
• DAiSESPat Teller, UTEP
• K42Paul Hargrove, LBNL
• MOLARStephen Scott, ORNL
• Peta-Scale SSIScott Studham, ORNL
• Rightweight KernelsRon Minnich, LANL
• Scalable FTJarek Nieplocha, PNNL
• SmartAppsL. Rauchwerger, T A&M
• ZeptoOSPete Beckman, ANL
www.HPC-Colony.org
Services & Interfaces For Very Large Linux Clusters
Terry Jones, LLNL, Coordinating PI Laxmikant Kale, UIUC, PI
Jose Moreira, IBM, PICelso Mendes, UIUCDerek Lieber, IBM
Overview
Lawrence Livermore National Laboratory
University of Illinois at Urbana-Champaign
International Business Machines
• Parallel Resource Instrumentation Framework
• Scalable Load Balancing• OS mechanisms for Migration• Processor Virtualization for Fault
Tolerance• Single system management space• Parallel Awareness and Coordinated
Scheduling of Services• Linux OS for cellular architecture
Services and Interfaces to Support Systems with Very Large Numbers of Processors
Collaborators
Topics
Title
Colony
Motivation
• Parallel resource management
Strategies for scheduling and load balancing must be improved. Difficulties in achieving a balanced partitioning and dynamically scheduling workloads can limit scaling for complex problems on large machines.
• Global system management
System management is inadequate. Parallel jobs require common operating system services, such as process scheduling, event notification, and job management to scale to large machines.
Colony
Goals
• Develop infrastructure and strategies for automated parallel resource management
– Today, application programmers must explicitly manage these resources. We address scaling issues and porting issues by delegating resource management tasks to a sophisticated parallel OS.
– “Managing Resources” includes balancing CPU time, network utilization, and memory usage across the entire machine.
• Develop a set of services to enhance the OS to improve its ability to support systems with very large numbers of processors
– We will improve operating system awareness of the requirements of parallel applications.
– We will enhance operating system support for parallel execution by providing coordinated scheduling and improved management services for very large machines.
Colony
Approach• Top Down
– Our work will start from an existing full-featured OS and remove excess baggage with a “top down” approach.
• Processor virtualization – One of our core techniques: the programmer divides the computation into a
large number of entities, which are mapped to the available processors by an intelligent runtime system.
• Leverage Advantages of Full Featured OS & Single System Image– Applications on these extreme-scale systems will benefit from extensive
services and interfaces; managing these complex systems will require an improved “logical view”
• Utilize Blue Gene– Suitable platform for ideas intended for very large numbers
of processors
Colony
Configurable OS Framework
• Sandia, lead– Ron Brightwell, PI– Rolf Riesen
• Caltech– Thomas Sterling, PI
• UNM– Barney Maccabe, PI– Patrick Bridges
Issues
• Novel architectures– Lots of execution environments
• Programming models– MPI, UPC, separating processing from location
• Shared services– File systems, shared WAN
• Usage model– Dedicated, space shared, time shared
Approach• Build application specific OS
– Architecture, programming model, shared resources, usage model
• Develop a collection of Micro services– Compose and distribute
• Compose services– Services may adapt
• Kinds of services– Memory allocation, signal delivery, message
receipt and handler activation
The Picture
Challenges
• How to reason about combinations• Dependencies among services• Efficiency
– Overhead associated with transfers between micro services
• How many operating systems will we really need?
Generalized Customized resource management
Fixed Dynamically Adaptable OS/runtime services
Enhanced Performance
GoalsDynamic Adaptability in Support of Extreme Scale
Determining
• What to adapt
• When to adapt
• How to adapt
• How to measure effects of adaptation
ChallengesDynamic Adaptability in Support of Extreme Scale
• Develop mechanisms to dynamically sense, analyze, and adjust common performance metrics, fluctuating workload situations, and overall system environment conditions
• Demonstrate, via Linux prototypes and experiments, dynamic self-tuning/provisioning in HPC environments
• Develop a methodology for general-purpose OS adaptation
DeliverablesDynamic Adaptability in Support of Extreme Scale
identify adaptationtargets
characterize workloadresource usage patterns
(re)determine adaptation intervals
define/adapt heuristics to trigger adaptation
generate/adapt monitoring, triggering andadaptation code, and attach it to OS
potential adaptation targets
Methodology
KernInstmonitor application execution, triggering
adaptation as necessary
off line
off line/run time
Dynamic Adaptability in Support of Extreme Scale
InstrumentationTool
Client
KernInst APIKernInst Device
Linux Kernel
KernInst Daemon
IBM pSeries eServer 690
KernInst
• KernInst and Kperfmon provide the capability to perform dynamic monitoring and adaptation of commodity operating systems.
• University of Wisconsin’s KernInst and Kperfmon make the problem of run-time monitoring and adaptation more tractable.
dynamic instrumentation of the kernel
Dynamic Adaptability in Support of Extreme Scale
Customization of • process scheduling parameters and algorithms, e.g.,
scheduling policy for different job types (prototype in process)
• file system cache size and management• disk cache management• size of OS buffers and tables• I/O, e.g., checkpoint/restart • memory allocation and management parameters and
algorithms
Example AdaptationsDynamic Adaptability in Support of Extreme Scale
University of Texas at El PasoDepartment of Computer SciencePatricia J. Teller ([email protected])
University of Wisconsin — MadisonComputer Sciences DepartmentBarton P. Miller ([email protected])
International Business Machines, Inc.Linux Technology CenterBill Buros ([email protected])
U.S. Department of EnergyOffice of ScienceFred Johnson ([email protected])
PartnersDynamic Adaptability in Support of Extreme Scale
High End Computing with K42
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Paul H. Hargrove and
Katherine YelickLawrence Berkeley National Lab
Angela Demke Brown and
Michael StummUniversity of Toronto
Patrick Bridges University of New Mexico
Orran Krieger and
Dilma Da SilvaIBM
Project Motivation
• The HECRTF and FastOS reports enumerate unmet needs in the area of Operating Systems for HEC, including– Availability of Research Frameworks– Support for Architectural Innovation– Performance Visibility– Ease of Use– Adaptability to Application Requirements
• This project uses the K42 Operating System to address these five needs
K42
K42 Background
• K42 is a research OS from IBM– API/ABI compatibility with Linux– Designed for large 64-bit SMPs– Extensible object-oriented design
• Features per resource-instance objects• Can change implementation/policy for individual instances at
runtime
– Extensive performance-monitoring– Many traditional OS functions are performed in user-space
libraries
K42
What Work Remains? (1 of 2)
• Availability of Research Frameworks & Support for Architectural Innovation K42 is already a research platform, used by IBM for their PERCS
project (DARPA HPCS) to support architectural innovation Work remains to expand K42 from SMPs to clusters
• Performance Visibility Existing facilities are quite extensive Work remains to use runtime replacement of object
implementations to monitoring single objects for fine-grained control
K42
What Work Remains? (2 of 2)
• Ease of Use Work remains to make K42 widely available, and to bring
HEC user environments to K42 (e.g. MPI, batch systems, etc.)
• Adaptability to Application Requirements Runtime replacement of object implementations provides
extreme customizability Work remains to provide implementations appropriate to
HEC, and to perform automatic dynamic adaptation
K42
MOLAR: Modular Linux and Adaptive Runtime Support for High-end Computing Operating and Runtime Systems
Coordinating Principal Investigator
Stephen L. Scott, ORNL
Principal Investigators
J. Vetter, D.E. Bernholdt, C. Engelmann – ORNL
C. Leangsuksun – Louisiana Tech University
P. Sadayappan – Ohio State University
F. Mueller – North Carolina State University
Collaborators
A.B. Maccabe – University of New Mexico
C. Nuss, D. Mason – Cray Inc.
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
MOLAR research goals
• Create a modular and configurable Linux system that allows customized changes based on the requirements of the applications, runtime systems, and cluster management software.
• Build runtime systems that leverage the OS modularity and configurability to improve efficiency, reliability, scalability, ease-of-use, and provide support to legacy and promising programming models.
• Advance computer reliability, availability and serviceability (RAS) management systems to work cooperatively with the OS/R to identify and preemptively resolve system issues.
• Explore the use of advanced monitoring and adaptation to improve application performance and predictability of system interruptions.
MOLAR
High-end Computing OS Research MapMOLAR: Modular Linux and Adaptive Runtime support
HEC Linux OS: modular, custom, light-weight
RAS
High availability
Monitoring
Root cause analysis
Kernel design
Performance Observation
Communications, IO
Testbeds
ProvidedExtend/adaptruntime/OS
PROBLEM:• Current OSs and runtime systems (OS/R) are unable to meet the various requirements to run large
applications efficiently on future ultra-scale computers.
GOALS:• Development of a modular and configurable Linux framework.• Runtime systems to provide a seamless coordination between system levels.• Monitoring and adaptation of the operating system, runtime, and applications.• Reliability, availability, and serviceability (RAS)• Efficient system management tools.
IMPACT:• Enhanced support and better understanding of extremely scalable architectures.• Proof-of-concept implementation open to community researchers.
MOLAR
MOLAR crosscut capability deployed for RAS
• Monitoring Core Daemon• service monitor• resource monitor• hardware health monitor
• Head nodes: active / hot standby
• Services: active / hot standby
• Modular Linux systems deployment & development
MOLAR
MOLAR Federated System Management (fSM)
• fSM emphasizes simplicity• self-build• self-configuration• self-healing• simplified operation
• Expand MOLAR support:• Investigate specialized
architectures• Investigate other
environments & OSs
• Head nodes: active / active
• Services: active / active
MOLAR
Peta-Scale Single-System Image A framework for a single-system image Linux environment
for 100,000+ processors and multiple architectures
Coordinating Investigator
R. Scott Studham, ORNL
Principal Investigators
Alan Cox, Rice University
Bruce Walker, HP
Investigators
Peter Druschel, Rice University
Scott Rixner, Rice University
Collaborators
Peter Braam, CFS
Steve Reinhardt, SGI
Stephen Wheat, Intel
Project Key Objectives
OpenSSI to 10,000 nodes Integration of OpenSSI with nodes with high processor counts The scalability of a shared root filesystem to 10,000 nodes Scalable booting and monitoring mechanisms Research enhancements to OpenSSI’s P2P communications The use of very large page sizes (superpages) for large address spaces Determine the proper interconnect balance as it impacts the operating
system (OS) Establish system-wide tools and process management for a 100,000
processor environment OS noise (services that interrupt computation) effects Integrating a job scheduler with the OS Preemptive task migration.
Peta-Scale SSI
Reduce OS-Noise and increase cluster scalability via efficient compute nodes
LVSDLM
Lustreclient
ICS
Install and sysadmin
Boot and Init
Applicationmonitoring and restart
MPIHA Resource
Mgmt and Job
Scheduling
Service Nodessingle install; local boot (for HA); single IP (LVS)connection load balancing (LVS);single root with HA (Lustre):single file system namespace (Lustre); single IPC namespace; single process space and process load leveling;application HA strong/strict membership;
Compute Nodessingle install; network or local boot; not part of single IP and no connection load balance single root with caching (Lustre);single file system namespace (Lustre); no single IPC namespace (optional); single process space but no process load leveling;no HA participation; scalable (relaxed) membership; inter-node communication channels on demand only
Processload levelingIPCDevices
ClusterFilesystem
CFSRemote File Block
Vproc
CLMS Lustreclient
ICS
BootMPI
CLMSLite
Remote File Block
Vproc
Peta-Scale SSI
Researching the intersection of SSI and large kernels to get to 100,000+ processors
2048 CPUs
Sing
le L
inux
Ker
nel
1 CPU10,000 NodesSoftware SSI Clusters1 Node
Stock Linux KernelTypical SSI
Continue SGIs work on single kernel scalability
Continue OpenSSI’s work on SSI scalability
Test the intersection large kernels with software OpenSSI to establish the sweat spot for 100,000 processor Linux environments
1) Establish scalability baselines2) Enhance scalability of both approaches3) Understand intersection of both methods
Peta-Scale SSI
Right-Weight Kernels
The right kernel, in the right place, at
the right time
OS effect on Parallel Applications
• Simple problem: if all processors save one arrive at a join, then all wait for the laggard [Mraz SC ’94]– Mraz resolved the problem for AIX, interestingly,
with purely local scheduling decisions (i.e., no global scheduler)
– Sandia resolved it by getting rid of the OS entirely (i.e., creation of the “Light-Weight Kernel”)
• AIX has more capability than many apps need• LWK has less capability than many apps want
RWK
Hence Right-Weight Kernels
• Customize the kernel to the app• We’re looking at two different approaches• Customized, Modular Linux
– Based on 2.6– With some scheduling enhancements
• “COTS” Secure LWK– Based, after some searching, on Plan 9– With some performance enhancements
RWK
Balancing Capability and Overhead
• We need to balance the capabilities that an full OS gives the user with the overhead of providing such services
• For a given app, we want to be as close to the “optimal” balance as possible
• But how do we measure what that is?
AIX, Tru64, Solaris,
Linux, etc.
No OS
increasing per node capability
decreasing OS impact on appRWKRWK
RWK
RWK
Measuring what is “good”
• OS activity is periodic, thus we need to use techniques such as time series analysis to evaluate the measured data– Use this data to figure out what is “good” and “bad”
• Caveat: you must practice good sampling hygiene [Sottile & Minnich, Cluster ’04]– Must follow rules of statistical sampling– Measuring work per unit of time leads to statistically
sound data– Measuring time per unit of work leads to
meaningless data
RWK
Conclusions
• Use sound statistical measurement techniques to figure out what is “good”
• Configure compute nodes on a per app basis (Right-Weight Kernel)
• Rinse and repeat!
• Collaborators– Sung-Eun Choi, Matt Sottile, Erik Hendriks (LANL)– Eric Grosse, Jim McKie, Vic Zandy (Bell Labs)
RWK
SFT: Scalable Fault Tolerant Runtime and Operating Systems
Pacific Northwest National LaboratoryLos Alamos National Laboratory
University of IllinoisQuadrics
Team
• Jarek Nieplocha, PNNL
• Fabrizio Petrini and Kei Davis (LANL)
• Josep Torrellas and Yuanyuan Zhou (UIUC)
• David Addison (Quadrics)
• Industrial Partner: Stephen Wheat (Intel)
SFT
Motivation
• With the massive number of components comprising the forthcoming petascale computer systems, hardware failures will be routinely encountered during execution of large-scale applications.
• Application Driver– Multidisciplinary, multiresolution, and multiscale nature of scientific
problems – drive the demand for high end systems – applications place increasingly differing demands on the system
resources: disk, network, memory, and CPU.
• Therefore, it will not be cost-effective or practical to rely on a single fault tolerance approach for all applications.
SFT
Goals
• Develop scalable and practical techniques for addressing fault tolerance at the Operating System and Runtime levels– Design based on requirements of DoE
applications– Minimal impact on application performance
SFT
Petaflop Architecture
Tightly coupled node Globally addressable but non-coherent between nodes
......processors
memories
interconnection network
Tightly coupled node Globally addressable but non-coherent between nodes
......processors
memories
......processors
memories
interconnection network
SFT
Scope
• We will investigate, develop, and evaluate a comprehensive range of techniques for fault tolerance. – System level incremental checkpointing approach
• based on Buffered CoScheduling• temporal and spatial hybrid checkpointing• in-memory checkpointing and efficient handling of I/O
– Fault awareness in communication libraries• while exploiting high performance network communication • MPI, ARMCI• scalability
– Feasibility analysis of incremental checkpointing
SFT
Buffered CoScheduling
SFT
SmartApps: Middleware for Adaptive
Applications on Reconfigurable Platforms
Lawrence Rauchwergerhttp://parasol.tamu.edu/~rwerger/
Parasol Lab, Dept of Computer Science, Texas A&M
Today: System Centric Computing
•Compilers are conservative
•OS offers generic services
•Architecture is generic
No Global Optimization
•No matching between Application/OS/HW
•intractable for the general case
WHAT’s MISSING ?
Classic avenues to performance:
•Parallel Algorithms
•Static Compiler Optimization
•OS support
•Good Architecture
Application
Compiler
HW
OS
System-Centric ComputingSystem-Centric Computing
Compiler(static)
Application(algorithm)
System(OS & Arch)
Execution
Development,Analysis &Optimization
Input Data
SmartApps
Our Approach: SmartAppsApplication Centric Computing
Application
Compiler
HW
OS
Application-Centric Computing
Compiler (static) +run-time techniques
Application(algorithm)
Run-time System:Execution, Analysis& Optimization
Development,Analysis &Optimization
Input DataArchitecture
(reconfigurable)
OS(modular)
Compiler(run-time)
SmartApp
Compiler + OS + Architecture + Data + Feedback
Application ControlInstance-specific optimization
SmartApps
SmartApps Architecture
Compiled code + runtime hooks
Static STAPL CompilerAugmented withruntime techniques
Predictor &Optimizer
STAPL STAPL ApplicationApplication
advanced advanced stagesstages
development development stagestage
ToolboxToolbox
Get Runtime Information(Sample input, system information, etc.)
Execute Application
Continuously monitor performance and adaptas necessary
Predictor &Optimizer
Predictor &Evaluator
Adaptive Software
Runtime tuning (w/o recompile)
Compute Optimal Applicationand RTS + OS Configuration
Recompute Applicationand/or Reconfigure RTS + OS
Configurer
Predictor &Evaluator
Smart Application
Small adaptation (tuning)
Large adaptation(failure, phase change)
DataBase
Adaptive RTS+ OS
SmartApps
SmartApps written in STAPL
• STAPL (Standard Template Adaptive Parallel Library): – Collection of generic parallel algorithms, distributed
containers & run-time system (RTS)– Inter-operable with Sequential Programs– Extensible, Composable by end-user– Shared Object View: No explicit communication– Distributed Objects: no replication/coherence– High Productivity Environment
SmartApps
The STAPL Programming Environment
RTS + Communication Library (ARMI)
OpenMP/MPI/pthreads/native
pAlgorithms pContainers
User Code
pRange
Interface to OS (K42)
SmartApps
SmartApps to RTS to OSSpecialized Services from Generic OS Services
– OS offers one size fits all services. – IBM K42 offers customizable services– We want customized services BUT…. we do not want to
write them
Interface between SmartApps(RTS) & OS(k42) • Vertical integration of Scheduling/Memory
Management
SmartApps
Collaborative Effort:• STAPL (Amato/Rauchwerger)
• STAPL Compiler (Rauchwerger/Stroustrup/Quinlan)
• RTS – K42 Interface & Optimizations (Krieger/Rauchwerger)
• Applications (Amato/Adams/ others)
• Validation on DOE extreme HW BlueGene (Moreira) , possibly PERCS
(Krieger/Sarkar) Texas A&M (Parasol, NE) + IBM + LLNL
SmartApps
ZeptoOSStudying Petascale Operating Systems
with Linux
Argonne National Laboratory
Pete Beckman
Bill Gropp
Rusty Lusk
Susan Coghlan
Suravee Suthikulpanit
University of Oregon
Al Malony
Sameer Shende
Observations:
• Extremely large systems run an “OS Suite”– BG/L and Red Storm both have at least
4 different operating system flavors
• Functional Decomposition trend lends itself toward a customized, optimized point-solution OS
• Hierarchical Organization requires software to manage topology, call forwarding, and collective operations
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
ZeptoOs
ZeptoOS
• Investigating 4 key areas:– Linux as an ultra-lightweight kernel
• Memory mgmt, scheduling efficiency, network
– Collective OS calls• Explicit collective behavior may be key (DLLs?)
– OS Performance monitoring for hierarchical systems
– Fault tolerance
ZeptoOs
Linux as a Lightweight KernelWhat does an OS steal from a selfish CPU application?
• Purpose: Micro benchmark measuring CPU cycles provided to benchmark application
• Helps understand “MPI-reduce problem” and gang scheduling issues
ZeptoOs
Collective OS Calls
• Collective messaging passing calls have been very efficiently implemented on many architectures
• Collective I/O calls permit scalable, efficient (non-Posix) file I/O
• Collective OS calls, such as dynamically loading libraries, may provide scalable OS functionality
ZeptoOs
Scalable OS Performance Monitoring(U of Oregon)
• TAU provides a framework for scalable performance analysis
• Integration of TAU into hierarchical systems, such as BG/L, will all us to explore:– Instrumentation of light-weight kernels
• Call forwarding, memory, etc
– Intermediate, parallel aggregation of performance data at I/O nodes
– Integration of data from the OS Suite
ZeptoOs
Exploring Faults: Faulty Towers
• Modify Linux so we can selectively and predictably break things
• Run user code, middleware, etc at ultra scale, with faults
• Explore metrics for codes with good “survivability”
Memory KernelMPI/Net Disk Middleware
It’s not a bug, it’s a feature!
Dial-a-Disaster
ZeptoOs
Simple Counts
• OSes (4): Linux (6.5), K-42 (2), Custom (1), Plan 9 (.5)
• Labs (7): ANL, LANL, ORNL, LBNL, LLNL, PNNL, SNL
• Universities: Caltech, Louisiana Tech, NCSU, Rice, Ohio State, Texas A&M, Toronto, UIUC, UTEP, UNM, U of Chicago, U of Oregon, U of Wisconsin
• Industry: Bell Labs, Cray, HP, IBM, Intel, CFS (Lustre), Quadrics, SGI
Apple Pie• Open source• Partnerships: Labs, universities, and industry• Scope: basic research, applied research,
development, prototypes, testbed systems, and deployment
• Structure: “don’t choose a winner too early”– Current or near-term problems -- commonly used,
open-source Oses (e.g., Linux or FreeBSD)– Prototyping work in K42 and Plan 9– At least one wacko project (explore novel ideas
that don’t fit into an existing framework)
A bit more interesting• Virtualization
– Colony
• Adaptability– DAiSES, K42, MOLAR, SmartApps– Config, RWK
• Usage model & system mgmt (OS Suites)– Colony, Config, MOLAR, Peta-scale SSI, Zepto
• Metrics & Measurement– HPC Challenge (http://icl.cs.utk.edu/hpcc/)– DAiSES, K42, MOLAR, RWK, Zepto
• Fault handling– Colony, MOLAR, Scalable FT, Zepto
continued• Managing the memory hierarchy• Security• Common API
– K42, Linux
• Single System Image– Peta-scale SSI
• Collective Runtime– Zepto
• I/O– Peta-scale SSI
• OS Noise– Colony, Peta-scale SSI, RWK, Zepto
Application Driven
• Meet the application developers– OS presentations– Apps people panic -- what are you doing to
my machine?– OS people tell ‘em what we heard– Apps people tell us what we didn’t
understand