ffmk: a fast and fault-tolerant ... - isc high performance · mpi and performance analysis the...

1
Application FFMK: A FAST AND FAULT-TOLERANT MICROKERNEL- BASED SYSTEM FOR EXASCALE COMPUTING Scientific Network (Selection) Center for Advancing Electronics Dresden TU Dresden Excellence Cluster Frank Bellosa Karlsruhe Institute of Technology Laxmikant V. Kale University of Illinois at Urbana- Champain, Charm++ Yutaka Ishikawa University of Tokyo RIKEN Argo / Hobbes / mOS Argonne / Sandia / Intel SPPEXA: ESSEX / GROMEX Gerhard Wellein / Ivo Kabadshow Highly Adaptive Energy- Efficient Computing SFB912 ASTEROID SPP1500 Gernot Heiser UNSW, NICTA Vijay Saraswat IBM Research Zürich, X10 Torsten Hoefler ETH Zurich Michael Bussmann Helmholtz Zentrum Dresden Rossendorf Eric Van Hensbergen IBM Research Austin DARPA HPCS, FastOS, X-Stack Frank Mueller North Carolina State University Phase 1 Results: Summary First L4-based prototype Several source-compatible MPI applications ported Tested on small island of real HPC cluster Gossip scalability and resilience modeled, simulated, and measured Erasure-coded in-memory checkpoints with XtreemFS, tested on Cray XC40 2 SPPEXA Workshops Prof. Alexander Reinefeld Zuse Institute Berlin Thomas Steinke Thorsten Schuett Florian Wende Distributed File Systems Flease - Lease Coordination Without a Lock Server, International Parallel and Distributed Processing Symposium, 2011 Consistency and Fault Tolerance for Erasure- Coded Distributed Storage Systems, Workshop on Data Intensive Distributed Computing at HPDC 2012 Prof. Amnon Barak Hebrew University of Jerusalem Amnon Shiloh Ely Levy Tal Ben-Nun Alexander Margolin Michael Sutton Load Balancing Resilient Gossip Algorithms for Collecting Online Management Information in Exascale Clusters, Concurrency and Computation: Practice and Experience, 2015 An Opportunity Cost Approach for Job Assignment in a Scalable Computing Cluster, IEEE Transactions on Parallel and Distributed Systems Vol. 11, 2000 FFMK System Architecture L4 microkernel on every node Programming paradigms provided as library-based runtimes Performance-critical parts of MPI, InfiniBand, and checkpointing run directly on L4 Non-critical support functionality reuses Linux (e.g., XtreemFS MRC+OSD, MPI startup+control) Gossip algorithms disseminate info for platform management Linux compatibility via virtualization Optional application hints can improve decision making GROMEX, COSMO-SPECS+FD4, CP2K, benchmarks, mini apps, … Second L4-based Prototype: Decoupled Execution Avoids operating system noise by sidestepping Linux HPC Applications are ordinary Linux processes, but its threads moved to compute core controlled by L4 Communication via InfiniBand via direct hardware access Linux System calls: Move thread back into Linux, handle operation on service core, then return to compute core L4 system calls: faster scheduling, threads, memory, … Prof. Wolfgang E. Nagel Technische Universität Dresden, ZIH Matthias Lieber MPI and Performance Analysis The International Exascale Software Roadmap, International Journal of High Performance Computer Applications 25(1), 2011 VAMPIR: Visualization and Analysis of MPI Resources, Supercomputer 63, XII(1):69–80, 1996 Prof. Hermann Härtig Technische Universität Dresden Carsten Weinhold Adam Lackorzynski Jan Bierbaum Martin Küttler Maksym Planeta Hannes Weisbach L4 Microkernel Linux Non-critical App Critical App The Performance of μ-Kernel- Based Systems, SOSP 1997 VPFS: Building a Virtual Private File System with a Small Trusted Computing Base, EuroSys 2008 ATLAS: Look-Ahead Scheduling Using Workload Metrics, RTAS 2013 Operating Systems Metadata File Content OSDs MRC XtreemFS Client The Hebrew University of Jerusalem Runtime Linux Kernel Application Light-weight Kernel (L4 Microkernel) Proxies MPI Library MPI Platform Management Decision Making Gossip Infiniband Infiniband Monitor Monitor Service cores Compute cores Chkpt. Checkpoint Decision Making Application hints Hardware monitoring Migration Platform Info (Gossip) Resource Prediction 0 10 20 19.0 s 19.0 s 19.0 s 19.0 s 19.1 s 19.2 s 19.5 s 20.0 s 12.2 s 12.2 s 12.2 s 12.2 s 12.3 s 12.4 s 12.7 s 13.2 s No Gossip 1024 ms 256 ms 64 ms 16 ms 8 ms 4 ms 2 ms Gossip: Scalability/Overhead MPI-FFT (Blue Gene/Q, 1024 nodes) Dynamic Platform Management Consider CPU cycles, memory bandwidth, and other resources Classification based on memory load („memory dwarfs“) to optimize scheduling and placement Prediction of resource usage using hardware counters and application-level hints (e.g., number of particles, time steps) Fault Tolerance Application interfaces to optimize or avoid C/R (e.g., hints on when to checkpoint, ability to recover from node loss) Node-level fault tolerance: Multiple Linux instances, micro- rebooting, proactive migration away from failing nodes Related Projects 0 20 40 60 80 100 120 140 0 60 120 180 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Computation time (fraction) Load Imbalances COSMO- SPECS+FD4 (128 ranks,180 timesteps) 0 60 120 180 Timestep Process ID L 4 Linux L4 Microkernel Run Time in Seconds 18,10 18,20 18,30 18,40 18,50 18,60 18,70 Number of Cores 30 150 270 390 510 630 750 Standard Decoupled Phase 1 (2013 – 2015) Phase 2 (2016 – 2017)

Upload: others

Post on 21-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FFMK: A FAST AND FAULT-TOLERANT ... - ISC High Performance · MPI and Performance Analysis The International Exascale Software Roadmap, International Journal of High Performance Computer

Application

FFMK: A FAST AND FAULT-TOLERANT MICROKERNEL- BASED SYSTEM FOR EXASCALE COMPUTING

Scientific Network (Selection)

Center for Advancing Electronics Dresden TU Dresden Excellence Cluster

Frank Bellosa Karlsruhe Institute of Technology

Laxmikant V. Kale University of Illinois at Urbana- Champain, Charm++

Yutaka Ishikawa University of Tokyo RIKEN

Argo / Hobbes / mOS Argonne / Sandia / Intel

SPPEXA: ESSEX / GROMEX Gerhard Wellein / Ivo Kabadshow

Highly Adaptive Energy- Efficient Computing SFB912

ASTEROID SPP1500

Gernot Heiser UNSW, NICTA

Vijay Saraswat IBM Research Zürich, X10

Torsten Hoefler ETH Zurich

Michael Bussmann Helmholtz Zentrum Dresden Rossendorf

Eric Van Hensbergen IBM Research Austin DARPA HPCS, FastOS, X-Stack

Frank Mueller North Carolina State University

Phase 1 Results: Summary ■ First L4-based prototype

■ Several source-compatible MPI applications ported

■ Tested on small island of real HPC cluster

■ Gossip scalability and resilience modeled, simulated, and measured

■ Erasure-coded in-memory checkpoints with XtreemFS, tested on Cray XC40

■ 2 SPPEXA Workshops

Prof. Alexander Reinefeld Zuse Institute Berlin

Thomas Steinke Thorsten Schuett Florian Wende

Distributed File Systems

■ Flease - Lease Coordination Without a Lock Server,International Parallel and Distributed Processing Symposium, 2011

■ Consistency and Fault Tolerance for Erasure-Coded Distributed Storage Systems, Workshop on Data Intensive Distributed Computing at HPDC 2012

Prof. Amnon Barak Hebrew University of Jerusalem

Amnon Shiloh Ely Levy Tal Ben-Nun Alexander Margolin Michael Sutton

Load Balancing ■ Resilient Gossip Algorithms for

Collecting Online Management Information in Exascale Clusters, Concurrency and Computation: Practice and Experience, 2015

■ An Opportunity Cost Approach for Job Assignment in a Scalable Computing Cluster,IEEE Transactions on Parallel and Distributed Systems Vol. 11, 2000

FFMK System Architecture

■ L4 microkernel on every node

■ Programming paradigms provided as library-based runtimes

■ Performance-critical parts of MPI, InfiniBand, and checkpointing run directly on L4

■ Non-critical support functionality reuses Linux (e.g., XtreemFS MRC+OSD, MPI startup+control)

■ Gossip algorithms disseminate info for platform management

■ Linux compatibility via virtualization

■ Optional application hints can improve decision making

■ GROMEX, COSMO-SPECS+FD4, CP2K, benchmarks, mini apps, …

Second L4-based Prototype: Decoupled Execution

■ Avoids operating system noise by sidestepping Linux

■ HPC Applications are ordinary Linux processes, but its threads moved to compute core controlled by L4

■ Communication via InfiniBand via direct hardware access

■ Linux System calls: Move thread back into Linux, handle operation on service core, then return to compute core

■ L4 system calls: faster scheduling, threads, memory, …

Prof. Wolfgang E. Nagel Technische Universität Dresden, ZIH

Matthias Lieber

MPI and Performance Analysis

■ The International Exascale SoftwareRoadmap, International Journal of High Performance Computer Applications 25(1), 2011

■ VAMPIR: Visualization and Analysis of MPI Resources, Supercomputer 63, XII(1):69–80, 1996

Prof. Hermann Härtig Technische Universität Dresden

Carsten Weinhold Adam Lackorzynski Jan Bierbaum Martin Küttler Maksym Planeta Hannes Weisbach

L4 Microkernel

Linux

Non-critical App

Critical App

■ The Performance of µ-Kernel-Based Systems, SOSP 1997

■ VPFS: Building a Virtual Private File System with a Small Trusted Computing Base, EuroSys 2008

■ ATLAS: Look-Ahead Scheduling Using Workload Metrics, RTAS 2013

Operating Systems

Metadata

File Content OSDs

MRC

XtreemFS Client

The Hebrew Universityof Jerusalem

Runtime Linux Kernel

Application

Light-weight Kernel (L4 Microkernel)

ProxiesMPI Library

MPI

Platform Management

Decision Making

Gossip

Infiniband InfinibandMonitor Mo

nit

or

Service coresCompute cores

Chkpt.

Checkpoint

Decision Making

Application hints

Hardware monitoring

MigrationPlatform Info (Gossip)

Resource Prediction

0 250 500 750 1000

Without Gossip

Interval = 1024 ms

Interval = 256 ms

Interval = 64 ms

Interval = 16 ms

Interval = 8 ms

Interval = 4 ms

Interval = 2 ms

1024 Nodes

994 GB/s

990 GB/s

993 GB/s

980 GB/s

990 GB/s

984 GB/s

971 GB/s

973 GB/s

0 400 800 1200 1600

2048 Nodes

1646 GB/s

1654 GB/s

1653 GB/s

1637 GB/s

1623 GB/s

1585 GB/s

1555 GB/s

0 800 1600 2400 3200

4096 Nodes

3217 GB/s

3235 GB/s

3212 GB/s

3210 GB/s

3106 GB/s

3060 GB/s

3064 GB/s

0 1400 2800 4200 5600

8192 Nodes

5652 GB/s

5638 GB/s

5641 GB/s

5594 GB/s

5417 GB/s

5395 GB/s

Figure 3: PTRANS performance in GB/s (higher is better).

0 10 20

Without Gossip

Interval = 1024 ms

Interval = 256 ms

Interval = 64 ms

Interval = 16 ms

Interval = 8 ms

Interval = 4 ms

Interval = 2 ms

1024 Nodes

19.0 s

19.0 s

19.0 s

19.0 s

19.1 s

19.2 s

19.5 s

20.0 s

12.2 s

12.2 s

12.2 s

12.2 s

12.3 s

12.4 s

12.7 s

13.2 s

0 20 40 60

2048 Nodes

50.2 s

50.2 s

50.4 s

50.5 s

51.1 s

52.0 s

54.0 s

36.4 s

36.4 s

36.5 s

36.6 s

37.2 s

38.1 s

40.0 s

0 20 40 60

4096 Nodes

40.0 s

40.0 s

40.2 s

40.7 s

42.6 s

45.3 s

32.9 s

32.9 s

33.1 s

33.6 s

35.5 s

38.1 s

0 20 40 60

8192 Nodes

27.8 s

28.0 s

28.5 s

29.9 s

35.2 s

42.2 s

24.0 s

24.2 s

24.7 s

26.1 s

31.4 s

38.4 s

Figure 4: MPI-FFT runtime (lower is better). Inner red part indicates the MPI portion.

0 10 20 30 40 50

Without Gossip

Interval = 1024 ms

Interval = 256 ms

Interval = 64 ms

Interval = 16 ms

Interval = 8 ms

Interval = 4 ms

Interval = 2 ms

Interval = 1 ms

1024 Nodes

40.6 s

40.6 s

40.6 s

40.6 s

40.6 s

40.6 s

40.7 s

40.8 s

41.1 s

8.2 s

8.2 s

8.2 s

8.2 s

8.2 s

8.2 s

8.3 s

8.4 s

8.6 s

0 10 20 30 40 50

2048 Nodes

36.7 s

36.7 s

36.7 s

36.7 s

36.8 s

36.9 s

37.2 s

38.0 s

4.3 s

4.2 s

4.2 s

4.2 s

4.3 s

4.4 s

4.7 s

5.5 s

0 10 20 30 40 50

4096 Nodes

36.7 s

36.7 s

36.7 s

36.8 s

37.3 s

37.6 s

38.7 s

4.3 s

4.3 s

4.4 s

4.4 s

4.8 s

5.1 s

6.1 s

0 10 20 30 40 50

8192 Nodes

37.3 s

37.3 s

37.3 s

37.5 s

37.9 s

38.2 s

4.8 s

4.8 s

4.9 s

5.0 s

5.3 s

5.6 s

Figure 5: COSMO-SPECS+FD4 runtime (lower is better). Inner red part indicates the MPI portion.

No Gossip 1024 ms 256 ms 64 ms 16 ms 8 ms 4 ms 2 ms

Gossip: Scalability/Overhead

MPI-FFT (Blue Gene/Q, 1024 nodes)

Dynamic Platform Management

■ Consider CPU cycles, memory bandwidth, and other resources

■ Classification based on memory load („memory dwarfs“) to optimize scheduling and placement

■ Prediction of resource usage using hardware counters and application-level hints (e.g., number of particles, time steps)

Fault Tolerance

■ Application interfaces to optimize or avoid C/R (e.g., hints on when to checkpoint, ability to recover from node loss)

■ Node-level fault tolerance: Multiple Linux instances, micro-rebooting, proactive migration away from failing nodes

Related Projects

-20 0

20 40 60 80

100 120 140

0 60 120 180

Pro

cess

ID

Timestep

COSMO-SPECS+FD4, 128 ranks

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Com

puta

tion

time

(frac

tion)

Load Imbalances

COSMO-SPECS+FD4 (128 ranks,180 timesteps)

-20 0

20 40 60 80

100 120 140

0 60 120 180

Pro

cess

ID

Timestep

COSMO-SPECS+FD4, 128 ranks

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Com

puta

tion

time

(frac

tion)

-20 0

20 40 60 80

100 120 140

0 60 120 180

Pro

cess

ID

Timestep

COSMO-SPECS+FD4, 128 ranks

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Com

puta

tion

time

(frac

tion)

L4Linux

L4 Microkernel

Run

Tim

e in

Sec

onds

18,10

18,20

18,30

18,40

18,50

18,60

18,70

Number of Cores

30 150 270 390 510 630 750

Standard Decoupled

Phase 1 (2013 – 2015)

Phase 2 (2016 – 2017)