fast-os bof sc 04 fastos follow link to subscribe to the mail list

FAST-OS BOF SC 04

http://www.cs.unm.edu/~fastos

Follow link to subscribe to the mail list

http://www.cs.unm.edu/~fastos

Projects

• ColonyTerry Jones, LLNL

• Config FrameworkRon Brightwell, SNL

• DAiSESPat Teller, UTEP

• K42Paul Hargrove, LBNL

• MOLARStephen Scott, ORNL

• Peta-Scale SSIScott Studham, ORNL

• Rightweight KernelsRon Minnich, LANL

• Scalable FTJarek Nieplocha, PNNL

• SmartAppsL. Rauchwerger, T A&M

• ZeptoOSPete Beckman, ANL

www.HPC-Colony.org

Services & Interfaces For Very Large Linux Clusters

Terry Jones, LLNL, Coordinating PI Laxmikant Kale, UIUC, PI

Jose Moreira, IBM, PICelso Mendes, UIUCDerek Lieber, IBM

Overview

Lawrence Livermore National Laboratory

University of Illinois at Urbana-Champaign

International Business Machines

• Parallel Resource Instrumentation Framework

• Scalable Load Balancing• OS mechanisms for Migration• Processor Virtualization for Fault

Tolerance• Single system management space• Parallel Awareness and Coordinated

Scheduling of Services• Linux OS for cellular architecture

Services and Interfaces to Support Systems with Very Large Numbers of Processors

Collaborators

Topics

Title

Colony

Motivation

• Parallel resource management

Strategies for scheduling and load balancing must be improved. Difficulties in achieving a balanced partitioning and dynamically scheduling workloads can limit scaling for complex problems on large machines.

• Global system management

System management is inadequate. Parallel jobs require common operating system services, such as process scheduling, event notification, and job management to scale to large machines.

Colony

Goals

• Develop infrastructure and strategies for automated parallel resource management

– Today, application programmers must explicitly manage these resources. We address scaling issues and porting issues by delegating resource management tasks to a sophisticated parallel OS.

– “Managing Resources” includes balancing CPU time, network utilization, and memory usage across the entire machine.

• Develop a set of services to enhance the OS to improve its ability to support systems with very large numbers of processors

– We will improve operating system awareness of the requirements of parallel applications.

– We will enhance operating system support for parallel execution by providing coordinated scheduling and improved management services for very large machines.

Colony

Approach• Top Down

– Our work will start from an existing full-featured OS and remove excess baggage with a “top down” approach.

• Processor virtualization – One of our core techniques: the programmer divides the computation into a

large number of entities, which are mapped to the available processors by an intelligent runtime system.

• Leverage Advantages of Full Featured OS & Single System Image– Applications on these extreme-scale systems will benefit from extensive

services and interfaces; managing these complex systems will require an improved “logical view”

• Utilize Blue Gene– Suitable platform for ideas intended for very large numbers

of processors

Colony

Configurable OS Framework

• Sandia, lead– Ron Brightwell, PI– Rolf Riesen

• Caltech– Thomas Sterling, PI

• UNM– Barney Maccabe, PI– Patrick Bridges

Issues

• Novel architectures– Lots of execution environments

• Programming models– MPI, UPC, separating processing from location

• Shared services– File systems, shared WAN

• Usage model– Dedicated, space shared, time shared

Approach• Build application specific OS

– Architecture, programming model, shared resources, usage model

• Develop a collection of Micro services– Compose and distribute

• Compose services– Services may adapt

• Kinds of services– Memory allocation, signal delivery, message

receipt and handler activation

The Picture

Challenges

• How to reason about combinations• Dependencies among services• Efficiency

– Overhead associated with transfers between micro services

• How many operating systems will we really need?

Generalized Customized resource management

Fixed Dynamically Adaptable OS/runtime services

Enhanced Performance

GoalsDynamic Adaptability in Support of Extreme Scale

Determining

• What to adapt

• When to adapt

• How to adapt

• How to measure effects of adaptation

ChallengesDynamic Adaptability in Support of Extreme Scale

• Develop mechanisms to dynamically sense, analyze, and adjust common performance metrics, fluctuating workload situations, and overall system environment conditions

• Demonstrate, via Linux prototypes and experiments, dynamic self-tuning/provisioning in HPC environments

• Develop a methodology for general-purpose OS adaptation

DeliverablesDynamic Adaptability in Support of Extreme Scale

identify adaptationtargets

characterize workloadresource usage patterns

(re)determine adaptation intervals

define/adapt heuristics to trigger adaptation

generate/adapt monitoring, triggering andadaptation code, and attach it to OS

potential adaptation targets

Methodology

KernInstmonitor application execution, triggering

adaptation as necessary

off line

off line/run time

Dynamic Adaptability in Support of Extreme Scale

InstrumentationTool

Client

KernInst APIKernInst Device

Linux Kernel

KernInst Daemon

IBM pSeries eServer 690

KernInst

• KernInst and Kperfmon provide the capability to perform dynamic monitoring and adaptation of commodity operating systems.

• University of Wisconsin’s KernInst and Kperfmon make the problem of run-time monitoring and adaptation more tractable.

dynamic instrumentation of the kernel

Dynamic Adaptability in Support of Extreme Scale

Customization of • process scheduling parameters and algorithms, e.g.,

scheduling policy for different job types (prototype in process)

• file system cache size and management• disk cache management• size of OS buffers and tables• I/O, e.g., checkpoint/restart • memory allocation and management parameters and

algorithms

Example AdaptationsDynamic Adaptability in Support of Extreme Scale

University of Texas at El PasoDepartment of Computer SciencePatricia J. Teller ([email protected])

University of Wisconsin — MadisonComputer Sciences DepartmentBarton P. Miller ([email protected])

International Business Machines, Inc.Linux Technology CenterBill Buros ([email protected])

U.S. Department of EnergyOffice of ScienceFred Johnson ([email protected])

PartnersDynamic Adaptability in Support of Extreme Scale

High End Computing with K42

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

Paul H. Hargrove and

Katherine YelickLawrence Berkeley National Lab

Angela Demke Brown and

Michael StummUniversity of Toronto

Patrick Bridges University of New Mexico

Orran Krieger and

Dilma Da SilvaIBM

Project Motivation

• The HECRTF and FastOS reports enumerate unmet needs in the area of Operating Systems for HEC, including– Availability of Research Frameworks– Support for Architectural Innovation– Performance Visibility– Ease of Use– Adaptability to Application Requirements

• This project uses the K42 Operating System to address these five needs

K42

K42 Background

• K42 is a research OS from IBM– API/ABI compatibility with Linux– Designed for large 64-bit SMPs– Extensible object-oriented design

• Features per resource-instance objects• Can change implementation/policy for individual instances at

runtime

– Extensive performance-monitoring– Many traditional OS functions are performed in user-space

libraries

K42

What Work Remains? (1 of 2)

• Availability of Research Frameworks & Support for Architectural Innovation K42 is already a research platform, used by IBM for their PERCS

project (DARPA HPCS) to support architectural innovation Work remains to expand K42 from SMPs to clusters

• Performance Visibility Existing facilities are quite extensive Work remains to use runtime replacement of object

implementations to monitoring single objects for fine-grained control

K42

What Work Remains? (2 of 2)

• Ease of Use Work remains to make K42 widely available, and to bring

HEC user environments to K42 (e.g. MPI, batch systems, etc.)

• Adaptability to Application Requirements Runtime replacement of object implementations provides

extreme customizability Work remains to provide implementations appropriate to

HEC, and to perform automatic dynamic adaptation

K42

MOLAR: Modular Linux and Adaptive Runtime Support for High-end Computing Operating and Runtime Systems

Coordinating Principal Investigator

Stephen L. Scott, ORNL

[email protected]

Principal Investigators

J. Vetter, D.E. Bernholdt, C. Engelmann – ORNL

C. Leangsuksun – Louisiana Tech University

P. Sadayappan – Ohio State University

F. Mueller – North Carolina State University

Collaborators

A.B. Maccabe – University of New Mexico

C. Nuss, D. Mason – Cray Inc.

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

MOLAR research goals

• Create a modular and configurable Linux system that allows customized changes based on the requirements of the applications, runtime systems, and cluster management software.

• Build runtime systems that leverage the OS modularity and configurability to improve efficiency, reliability, scalability, ease-of-use, and provide support to legacy and promising programming models.

• Advance computer reliability, availability and serviceability (RAS) management systems to work cooperatively with the OS/R to identify and preemptively resolve system issues.

• Explore the use of advanced monitoring and adaptation to improve application performance and predictability of system interruptions.

MOLAR

High-end Computing OS Research MapMOLAR: Modular Linux and Adaptive Runtime support

HEC Linux OS: modular, custom, light-weight

RAS

High availability

Monitoring

Root cause analysis

Kernel design

Performance Observation

Communications, IO

Testbeds

ProvidedExtend/adaptruntime/OS

PROBLEM:• Current OSs and runtime systems (OS/R) are unable to meet the various requirements to run large

applications efficiently on future ultra-scale computers.

GOALS:• Development of a modular and configurable Linux framework.• Runtime systems to provide a seamless coordination between system levels.• Monitoring and adaptation of the operating system, runtime, and applications.• Reliability, availability, and serviceability (RAS)• Efficient system management tools.

IMPACT:• Enhanced support and better understanding of extremely scalable architectures.• Proof-of-concept implementation open to community researchers.

MOLAR

MOLAR crosscut capability deployed for RAS

• Monitoring Core Daemon• service monitor• resource monitor• hardware health monitor

• Head nodes: active / hot standby

• Services: active / hot standby

• Modular Linux systems deployment & development

MOLAR

MOLAR Federated System Management (fSM)

• fSM emphasizes simplicity• self-build• self-configuration• self-healing• simplified operation

• Expand MOLAR support:• Investigate specialized

architectures• Investigate other

environments & OSs

• Head nodes: active / active

• Services: active / active

MOLAR

Peta-Scale Single-System Image A framework for a single-system image Linux environment

for 100,000+ processors and multiple architectures

Coordinating Investigator

R. Scott Studham, ORNL

Principal Investigators

Alan Cox, Rice University

Bruce Walker, HP

Investigators

Peter Druschel, Rice University

Scott Rixner, Rice University

Collaborators

Peter Braam, CFS

Steve Reinhardt, SGI

Stephen Wheat, Intel

http://www.lustre.org/index.html

http://www.intel.com/index.htm?iid=HPAGE+header_intellogo&

Project Key Objectives

OpenSSI to 10,000 nodes Integration of OpenSSI with nodes with high processor counts The scalability of a shared root filesystem to 10,000 nodes Scalable booting and monitoring mechanisms Research enhancements to OpenSSI’s P2P communications The use of very large page sizes (superpages) for large address spaces Determine the proper interconnect balance as it impacts the operating

system (OS) Establish system-wide tools and process management for a 100,000

processor environment OS noise (services that interrupt computation) effects Integrating a job scheduler with the OS Preemptive task migration.

Peta-Scale SSI

Reduce OS-Noise and increase cluster scalability via efficient compute nodes

LVSDLM

Lustreclient

ICS

Install and sysadmin

Boot and Init

Applicationmonitoring and restart

MPIHA Resource

Mgmt and Job

Scheduling

Service Nodessingle install; local boot (for HA); single IP (LVS)connection load balancing (LVS);single root with HA (Lustre):single file system namespace (Lustre); single IPC namespace; single process space and process load leveling;application HA strong/strict membership;

Compute Nodessingle install; network or local boot; not part of single IP and no connection load balance single root with caching (Lustre);single file system namespace (Lustre); no single IPC namespace (optional); single process space but no process load leveling;no HA participation; scalable (relaxed) membership; inter-node communication channels on demand only

Processload levelingIPCDevices

ClusterFilesystem

CFSRemote File Block

Vproc

CLMS Lustreclient

ICS

BootMPI

CLMSLite

Remote File Block

Vproc

Peta-Scale SSI

Researching the intersection of SSI and large kernels to get to 100,000+ processors

2048 CPUs

Sing

le L

inux

Ker

nel

1 CPU10,000 NodesSoftware SSI Clusters1 Node

Stock Linux KernelTypical SSI

Continue SGIs work on single kernel scalability

Continue OpenSSI’s work on SSI scalability

Test the intersection large kernels with software OpenSSI to establish the sweat spot for 100,000 processor Linux environments

1) Establish scalability baselines2) Enhance scalability of both approaches3) Understand intersection of both methods

Peta-Scale SSI

Right-Weight Kernels

The right kernel, in the right place, at

the right time

OS effect on Parallel Applications

• Simple problem: if all processors save one arrive at a join, then all wait for the laggard [Mraz SC ’94]– Mraz resolved the problem for AIX, interestingly,

with purely local scheduling decisions (i.e., no global scheduler)

– Sandia resolved it by getting rid of the OS entirely (i.e., creation of the “Light-Weight Kernel”)

• AIX has more capability than many apps need• LWK has less capability than many apps want

RWK

Hence Right-Weight Kernels

• Customize the kernel to the app• We’re looking at two different approaches• Customized, Modular Linux

– Based on 2.6– With some scheduling enhancements

• “COTS” Secure LWK– Based, after some searching, on Plan 9– With some performance enhancements

RWK

Balancing Capability and Overhead

• We need to balance the capabilities that an full OS gives the user with the overhead of providing such services

• For a given app, we want to be as close to the “optimal” balance as possible

• But how do we measure what that is?

AIX, Tru64, Solaris,

Linux, etc.

No OS

increasing per node capability

decreasing OS impact on appRWKRWK

RWK

RWK

Measuring what is “good”

• OS activity is periodic, thus we need to use techniques such as time series analysis to evaluate the measured data– Use this data to figure out what is “good” and “bad”

• Caveat: you must practice good sampling hygiene [Sottile & Minnich, Cluster ’04]– Must follow rules of statistical sampling– Measuring work per unit of time leads to statistically

sound data– Measuring time per unit of work leads to

meaningless data

RWK

Conclusions

• Use sound statistical measurement techniques to figure out what is “good”

• Configure compute nodes on a per app basis (Right-Weight Kernel)

• Rinse and repeat!

• Collaborators– Sung-Eun Choi, Matt Sottile, Erik Hendriks (LANL)– Eric Grosse, Jim McKie, Vic Zandy (Bell Labs)

RWK

SFT: Scalable Fault Tolerant Runtime and Operating Systems

Pacific Northwest National LaboratoryLos Alamos National Laboratory

University of IllinoisQuadrics

Team

• Jarek Nieplocha, PNNL

• Fabrizio Petrini and Kei Davis (LANL)

• Josep Torrellas and Yuanyuan Zhou (UIUC)

• David Addison (Quadrics)

• Industrial Partner: Stephen Wheat (Intel)

SFT

Motivation

• With the massive number of components comprising the forthcoming petascale computer systems, hardware failures will be routinely encountered during execution of large-scale applications.

• Application Driver– Multidisciplinary, multiresolution, and multiscale nature of scientific

problems – drive the demand for high end systems – applications place increasingly differing demands on the system

resources: disk, network, memory, and CPU.

• Therefore, it will not be cost-effective or practical to rely on a single fault tolerance approach for all applications.

SFT

Goals

• Develop scalable and practical techniques for addressing fault tolerance at the Operating System and Runtime levels– Design based on requirements of DoE

applications– Minimal impact on application performance

SFT

Petaflop Architecture

Tightly coupled node Globally addressable but non-coherent between nodes

......processors

memories

interconnection network

Tightly coupled node Globally addressable but non-coherent between nodes

......processors

memories

......processors

memories

interconnection network

SFT

Scope

• We will investigate, develop, and evaluate a comprehensive range of techniques for fault tolerance. – System level incremental checkpointing approach

• based on Buffered CoScheduling• temporal and spatial hybrid checkpointing• in-memory checkpointing and efficient handling of I/O

– Fault awareness in communication libraries• while exploiting high performance network communication • MPI, ARMCI• scalability

– Feasibility analysis of incremental checkpointing

SFT

Buffered CoScheduling

SFT

SmartApps: Middleware for Adaptive

Applications on Reconfigurable Platforms

Lawrence Rauchwergerhttp://parasol.tamu.edu/~rwerger/

Parasol Lab, Dept of Computer Science, Texas A&M

http://parasol.tamu.edu/~rwerger/

Today: System Centric Computing

•Compilers are conservative

•OS offers generic services

•Architecture is generic

No Global Optimization

•No matching between Application/OS/HW

•intractable for the general case

WHAT’s MISSING ?

Classic avenues to performance:

•Parallel Algorithms

•Static Compiler Optimization

•OS support

•Good Architecture

Application

Compiler

HW

OS

System-Centric ComputingSystem-Centric Computing

Compiler(static)

Application(algorithm)

System(OS & Arch)

Execution

Development,Analysis &Optimization

Input Data

SmartApps

Our Approach: SmartAppsApplication Centric Computing

Application

Compiler

HW

OS

Application-Centric Computing

Compiler (static) +run-time techniques

Application(algorithm)

Run-time System:Execution, Analysis& Optimization

Development,Analysis &Optimization

Input DataArchitecture

(reconfigurable)

OS(modular)

Compiler(run-time)

SmartApp

Compiler + OS + Architecture + Data + Feedback

Application ControlInstance-specific optimization

SmartApps

SmartApps Architecture

Compiled code + runtime hooks

Static STAPL CompilerAugmented withruntime techniques

Predictor &Optimizer

STAPL STAPL ApplicationApplication

advanced advanced stagesstages

development development stagestage

ToolboxToolbox

Get Runtime Information(Sample input, system information, etc.)

Execute Application

Continuously monitor performance and adaptas necessary

Predictor &Optimizer

Predictor &Evaluator

Adaptive Software

Runtime tuning (w/o recompile)

Compute Optimal Applicationand RTS + OS Configuration

Recompute Applicationand/or Reconfigure RTS + OS

Configurer

Predictor &Evaluator

Smart Application

Small adaptation (tuning)

Large adaptation(failure, phase change)

DataBase

Adaptive RTS+ OS

SmartApps

SmartApps written in STAPL

• STAPL (Standard Template Adaptive Parallel Library): – Collection of generic parallel algorithms, distributed

containers & run-time system (RTS)– Inter-operable with Sequential Programs– Extensible, Composable by end-user– Shared Object View: No explicit communication– Distributed Objects: no replication/coherence– High Productivity Environment

SmartApps

The STAPL Programming Environment

RTS + Communication Library (ARMI)

OpenMP/MPI/pthreads/native

pAlgorithms pContainers

User Code

pRange

Interface to OS (K42)

SmartApps

SmartApps to RTS to OSSpecialized Services from Generic OS Services

– OS offers one size fits all services. – IBM K42 offers customizable services– We want customized services BUT…. we do not want to

write them

Interface between SmartApps(RTS) & OS(k42) • Vertical integration of Scheduling/Memory

Management

SmartApps

Collaborative Effort:• STAPL (Amato/Rauchwerger)

• STAPL Compiler (Rauchwerger/Stroustrup/Quinlan)

• RTS – K42 Interface & Optimizations (Krieger/Rauchwerger)

• Applications (Amato/Adams/ others)

• Validation on DOE extreme HW BlueGene (Moreira) , possibly PERCS

(Krieger/Sarkar) Texas A&M (Parasol, NE) + IBM + LLNL

SmartApps

ZeptoOSStudying Petascale Operating Systems

with Linux

Argonne National Laboratory

Pete Beckman

Bill Gropp

Rusty Lusk

Susan Coghlan

Suravee Suthikulpanit

University of Oregon

Al Malony

Sameer Shende

Observations:

• Extremely large systems run an “OS Suite”– BG/L and Red Storm both have at least

4 different operating system flavors

• Functional Decomposition trend lends itself toward a customized, optimized point-solution OS

• Hierarchical Organization requires software to manage topology, call forwarding, and collective operations

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

ZeptoOs

ZeptoOS

• Investigating 4 key areas:– Linux as an ultra-lightweight kernel

• Memory mgmt, scheduling efficiency, network

– Collective OS calls• Explicit collective behavior may be key (DLLs?)

– OS Performance monitoring for hierarchical systems

– Fault tolerance

ZeptoOs

Linux as a Lightweight KernelWhat does an OS steal from a selfish CPU application?

• Purpose: Micro benchmark measuring CPU cycles provided to benchmark application

• Helps understand “MPI-reduce problem” and gang scheduling issues

ZeptoOs

Collective OS Calls

• Collective messaging passing calls have been very efficiently implemented on many architectures

• Collective I/O calls permit scalable, efficient (non-Posix) file I/O

• Collective OS calls, such as dynamically loading libraries, may provide scalable OS functionality

ZeptoOs

Scalable OS Performance Monitoring(U of Oregon)

• TAU provides a framework for scalable performance analysis

• Integration of TAU into hierarchical systems, such as BG/L, will all us to explore:– Instrumentation of light-weight kernels

• Call forwarding, memory, etc

– Intermediate, parallel aggregation of performance data at I/O nodes

– Integration of data from the OS Suite

ZeptoOs

Exploring Faults: Faulty Towers

• Modify Linux so we can selectively and predictably break things

• Run user code, middleware, etc at ultra scale, with faults

• Explore metrics for codes with good “survivability”

Memory KernelMPI/Net Disk Middleware

It’s not a bug, it’s a feature!

Dial-a-Disaster

ZeptoOs

Simple Counts

• OSes (4): Linux (6.5), K-42 (2), Custom (1), Plan 9 (.5)

• Labs (7): ANL, LANL, ORNL, LBNL, LLNL, PNNL, SNL

• Universities: Caltech, Louisiana Tech, NCSU, Rice, Ohio State, Texas A&M, Toronto, UIUC, UTEP, UNM, U of Chicago, U of Oregon, U of Wisconsin

• Industry: Bell Labs, Cray, HP, IBM, Intel, CFS (Lustre), Quadrics, SGI

Apple Pie• Open source• Partnerships: Labs, universities, and industry• Scope: basic research, applied research,

development, prototypes, testbed systems, and deployment

• Structure: “don’t choose a winner too early”– Current or near-term problems -- commonly used,

open-source Oses (e.g., Linux or FreeBSD)– Prototyping work in K42 and Plan 9– At least one wacko project (explore novel ideas

that don’t fit into an existing framework)

A bit more interesting• Virtualization

– Colony

• Adaptability– DAiSES, K42, MOLAR, SmartApps– Config, RWK

• Usage model & system mgmt (OS Suites)– Colony, Config, MOLAR, Peta-scale SSI, Zepto

• Metrics & Measurement– HPC Challenge (http://icl.cs.utk.edu/hpcc/)– DAiSES, K42, MOLAR, RWK, Zepto

• Fault handling– Colony, MOLAR, Scalable FT, Zepto

http://icl.cs.utk.edu/hpcc/

continued• Managing the memory hierarchy• Security• Common API

– K42, Linux

• Single System Image– Peta-scale SSI

• Collective Runtime– Zepto

• I/O– Peta-scale SSI

• OS Noise– Colony, Peta-scale SSI, RWK, Zepto

Application Driven

• Meet the application developers– OS presentations– Apps people panic -- what are you doing to

my machine?– OS people tell ‘em what we heard– Apps people tell us what we didn’t

understand

fast-os bof sc 04 fastos follow link to subscribe to the mail list

Documents

large machines

improved management

sophisticated parallel

job management

parallel execution

complex systems

parallel jobs

large number of entities