hare 2010 review

22
Holistic Aggregate Resource Environment Execution Model FastOS USENIX 2010 Workshop Eric Van Hensbergen ([email protected] ) http://hare.fastos2.org

Upload: eric-van-hensbergen

Post on 15-Dec-2014

675 views

Category:

Documents


2 download

DESCRIPTION

Presentation given at USENIX FastOS workshop reviewing HARE work done in 2010.

TRANSCRIPT

Page 1: HARE 2010 Review

Holistic Aggregate Resource EnvironmentExecution Model

FastOS USENIX 2010 WorkshopEric Van Hensbergen ([email protected])http://hare.fastos2.org

Page 2: HARE 2010 Review

Research Objectives

• Look at ways of scaling general purpose operating systems and runtimes to leadership class supercomputers (thousands to millions of cores)

• Alternative approaches to systems software support, runtime and communications subsystems

• Exploration built on top of Plan 9 distributed operating system due to portability, built-in facilities for distributed systems and flexible communication model

• Plan 9 support for BG/P and HARE runtime open-sourced and available via: http://wiki.bg.anl-external.org

• Public profile available on ANL Surveyor BG/P machine, should be usable by anyone

2

Page 3: HARE 2010 Review

Roadmap

0 1 2 3

Year

Hardware Support

Systems Infrastructure

Evaluation, Scaling, & Tuning

Year 2 Accomplishments

• Improved tracing infrastructure• Currying Framework• Scaling infrastructure to 1000 nodes• Execution model• Plan 9 for Blue Gene/P open sourced• Kittyhawk open sourced• Default profiles for Kittyhawk and Plan 9 installed at ANL on Surveyor

Page 4: HARE 2010 Review

New Publications (since Supercomputing 2009)

• Using Currying and process-private system calls to break the one-microsecond system call barrier, Ronald G. Minnich, John Floren, Jim Mckie; 2009 International Workshop on Plan9.

• Measuring kernel throughput on Blue Gene/P with the Plan 9 research operating system, Ronald G. Minnich, John Floren, Aki Nyrhinen; 2009 International Workshop on Plan9.

• XCPU3. Pravin Shinde, Eric Van Hensbergen, Eurosys, 2010• PUSH, a Dataflow Shell. N Evans, E Van Hensbergen,

Eurosys, 2010

4

Page 5: HARE 2010 Review

Ongoing Work

• File system and Cache Studies• simple cachefs deployable on I/O nodes and compute

nodes• experiments with direct attached storage using CORAID

• MPI Support (ROMPI)• Enhanced Allocator

• lower overhead allocator• working towards easier approach to multiple page sizes• working towards schemes capable of supporting hybrid

communication models• Scaling beyond 1000 nodes (runs on Intrepid at ANL)• Application and Runtime Integration

5

Page 6: HARE 2010 Review

Execution Model

6

Page 7: HARE 2010 Review

Core Concept: BRASILBasic Resource Aggregate System Inferno Layer

• Stripped down Inferno - No GUI or anything we can live without, minimal footprint

• Runs as a daemon (no console), all interaction via 9p mounts of its namespace

• Different modes• default (exports /srv/brasil or on a tcp!127.0.0.1!5670)• gateway (exports over standard I/O - to be used by ssh initialization)• terminal (initiates ssh connection and starts a gateway)

• Runs EVERYWHERE• User’s workstation• Surveyor Login Nodes• I/O Nodes• Compute Nodes

7

Page 8: HARE 2010 Review

nompirun: legacy friendly job launch

• user initiates execution from login node using nompirun script• ie. nompirun -n 64 ronsapp

• setup/boot/exec• script submits job using cobalt• When I/O node boots it connect to user’s brasild via 9P over Ethernet• When CPU nodes boot they connect to I/O node via 9P over Collective• After boilerplate initialization, $HOME/lib/profile is run on every node for

additional setup, namespace initialization, and environment setup• User specified application runs with specified arguments on all compute nodes,

application (and support data and configuration) can come from user’s home directory on login nodes or any available file server in the namespace

• Standard I/O based output from all compute nodes aggregated at I/O nodes and sent over miniciod channel (thanks to some sample code from the zeptoos team) to the service nodes for standard reporting

• Nodes boot and application execution begins in under 2 minutes

8

Page 9: HARE 2010 Review

Our Approach: Workload Optimized Distribution

9

Desktop Extension

!"#$%&

!"#$%&

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#$%&

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#$%&

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#$%&'

!"#(#

!"#(#

!"#(#

!"#(#

!"#(#

PUSH Pipeline Model

local service

remote services

local service proxy service aggregate service

Aggregation ViaDynamic Namespace

andDistributed Service

Model

Scaling Reliability

Page 10: HARE 2010 Review

Core Component: Multipipes & Filters

10

UNIX Modela | b | c

PUSH Modela |< b >| c

Page 11: HARE 2010 Review

Preferred Embodiment: BRASIL Desktop Extension Model

11

workstation login node I/O

CPU

CPU

ssh-duct

•Setup•User starts brasild on workstation

•brasild ssh’s to login node and starts another brasil hooking the two together with 27b-6 and mount resources in /csrv

•User mounts brasild on workstation into namespace using 9pfuse or v9fs (or can mount from Plan 9 peer node, 9vx, p9p or ACME-sac)

•Boot•User runs anl/run script on workstation

•script interacts with taskfs on login node to start cobalt qsub •when I/O nodes boot it will connect its csrv to login csrv•when CPU nodes boot they will connect to csrv on I/O node

•Task Execution•User runs anl/exec script on workstation to run app

•script reserves x nodes for app using taskfs•taskfs on workstation aggregates execution by using taskfs running on I/O nodes

Page 12: HARE 2010 Review

Core Concept: Central Services• Establish hierarchical namespace of cluster services• Automount remote servers based on reference (ie. cd /csrv/

criswell)

12

/csrv/local/L

/local/l1

/local/c1

/local/c2

/local/l2

/local/c3

/local/c4

/local

/csrv/local/l2

/local/c4

/local/L

/local/t

/local/l1

/local/c1

/local/c2

/local

c3

t

L

I1 I2

c1 c2 c4c3

Page 13: HARE 2010 Review

Core Concept: Taskfs

• Provide xcpu2 like interface for starting tasks on a node• Hybrid model for multitask (aggregate ctl & I/O as well as

granular)

13

/local - exported by each csrv node/fs - local (host) file system/net - local network interfaces/brasil - local (brasil) namespace/arch - architecture and platform/status - status (load/jobs/etc.)/env - default environment for host/ns - default namespace for host/clone - establish a new task/# - task sessions

/0/ctl/status/args/env/stdin/stdout/stderr/stdio/ns/wait/# - component session(s)

/ctl...

Page 14: HARE 2010 Review

14

XCPU3

Problem

Related Work

Solution

Evaluations

References

Evaluations:Deployment and aggregation time

XCPU3

Page 15: HARE 2010 Review

What’s Still Missing from Execution Model?• File System back mounts still being developed

• Can get around by mounting login node or user’s workstation to a known place no matter where you are in the system

• When we get file system back mounts, we’ll need a way to get to the user’s desired file system no matter where in csrv topology we are ($MNTTERM)

• Taskfs scheduling model still top down, needs to be able to propagate back up to allow efficient scheduling from leaf nodes

• Performance• Reworking workload distribution to go bottom up to improve

scalability and lower per-task overhead• Plan 9 native version of task model to improve performance

15

Page 16: HARE 2010 Review

New Model Breaks Up Implementation

• mpipefs provides base I/O and control aggregation• execfs provides layer on top of system procfs for additional

application control and initiating remote execution and uses mpipefs for interface to standard I/O

• gangfs provides group process operations and aggregation as well as providing core distributed scheduling interfaces and builds upon execfs and use mpipes for ctl aggregation

• statusfs will provide bottom up aggregation of system status through csrv hierarchy and feed metrics to gangfs scheduler using mpipes

• csrv component provides membership management and hierarchical links between nodes and provide failure detection, avoidance and recovery

16

Page 17: HARE 2010 Review

Future Work: Generalized Multipipe System Component

• Challenges• Record separation for large data sets• Determinism for HASH distributions• Support for multiple models

• Our Approach• Single synthetic file per multipipe, configurations specified

during pipe creation and initial write• Readers and Writers tracked and isolated• “Multipipe” mode uses headers for data flowing over pipes

• Provides record separation via size-prefix• Can be used by filters to specify deterministic destination or can be used to

allow for type-specific destinations• Can also send control messages in header blocks to control splicing

17

Page 18: HARE 2010 Review

Future Work on Execution Model

• Caches will be necessary for desktop extension model to perform well

• Linux target support (using private namespaces and back mounts within taskfs execs)

• attribute-based file system queries/views and operations• Probably best implemented as a secondary file system

layer on top of central services• Language bindings for taskfs interactions (C, C++, python,

etc.)• Plug-in scheduling policies• Failure and Reliability Model

18

Page 19: HARE 2010 Review

Questions?

• This work has been supported in part by the Department of Energy Office of Science Operating and Runtime Systems for Extreme Scale Scientific Computation project under contract #DE-FG02-

• More Info & Publications: http://hare.fastos2.org

19

Page 20: HARE 2010 Review

nompirun(8)

NAME nompirun − wrapper script for running Plan 9 on BG/P

SYNOPSIS nompirun [ ‐A cobalt_account ] [ ‐h brasil_host ] [ ‐h brasil_port ] [ ‐e key=value ]... [ ‐n num_nodes ] [ ‐t time ] [ ‐k kernel_profile ] [ ‐r root_path ] cmd args...

wrapnompirun [ ‐A cobalt_account ] [ ‐h brasil_host ] [ ‐h brasil_port ] [ ‐e key=value ]... [ ‐n num_cpu_nodes ] [ ‐t time ] [ ‐k ker‐ nel_profile ] [ ‐r root_path ] cmd args...

20

Page 21: HARE 2010 Review

Core Concept: Ducts

• Ducts are bi-directional 9P connections• They can be instantiated over any pipe

• TCP/IP Connection

21

export

exportmount

mount

ssh

tcp/ip

Page 22: HARE 2010 Review

Core Concept: 27b-6 Ducts

• Just like Ducts• Before export/mount, each side writes size-prefix canonical

name

22

export

exportmount

mount

ssh

tcp/ip