adapt or die: the challenge for mpi in a post-exascale world

25
Martin Schulz Technische Universität München Fakultät für Informatik SOS Workshop Asheville, NC, USA March 2019 Adapt or Die: The Challenge for MPI in a Post-Exascale World

Upload: others

Post on 27-Dec-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Adapt or Die: The Challenge for MPI in a Post-Exascale World

Martin Schulz

Technische Universität München

Fakultät für Informatik

SOS Workshop

Asheville, NC, USA

March 2019

Adapt or Die: The Challenge for MPI in a Post-Exascale World

Page 2: Adapt or Die: The Challenge for MPI in a Post-Exascale World

Rising complexity of architectures

• On node: accelerators, deep memory hierarchies, …

• Off node: new I/O systems, high-dim. networks, …

Rising complexity of applications

• New algorithms

• Ensemble computation for UQ, scale-bridging, …

Holistic HW/SW Co-Design to map applications to architectures

• Substantial work in the HW and Application layers

• Includes design of the middleware layer

Challenges will get even harder

• Severe resource limitations in power/energy, network, I/O, …

• Increased variability even on homogeneous systems

• Complex workflows with varying demands

• New workloads with new requirements and pain points

Fighting the Uphill Battle to Exascale

Page 3: Adapt or Die: The Challenge for MPI in a Post-Exascale World

Adaptivity is needed for efficient resource utilization

• Worst case provisioning is a waste of resources

• But limited resources lead to contention, variability, …

• Need to actively manage resources (e.g., for power, I/O, …)

Adaptivity is needed to counteract variability

• Increasing variability will be the new normal

• Need to actively balance and shift workloads

Adaptivity is needed to manage complex workflows

• Single, static applications in pure SPMD style will be a thing of the past

• Coupling of components for UQ and/or scale bridging

• Need to actively schedule components with varying demands

Adaptivity is needed by new workloads, especially in ML/DL/AI

This adaptivity must be managed and exploited across the whole software stack:

hardware, middleware and application – and that includes MPI

Adaptivity will be Key

Page 4: Adapt or Die: The Challenge for MPI in a Post-Exascale World

MPI’s Philosophy• MPI is running the show – once started very little external control

• MPI is controlling progress – little interaction with other runtimes

• MPI has fixed resources – MPI_COMM_WORLD

Some of the tried approaches

• Dynamic process management (since MPI 2)

• Helper threads (for BG/L)

• Many research projects on external progress, FT, …

Clearly not sufficient, even for HPC

• Moldabilitiy present but nowhere adopted

• Malleability not available at all, HPC applications use C/R instead

• Use of secondary communication systems is growing

PLUS: We are not addressing needs of new communities!

Question: (How) Can We Make MPI More Adaptive?

Where Is MPI When it Comes to Adaptivity?

Page 5: Adapt or Die: The Challenge for MPI in a Post-Exascale World

Need to maintain the key flavor of MPI

• Dynamicity should be limited and controlled / coarse grained

• Communicators are static and won’t change all the time (or ever?)

• Keep the well known communication constructs

• “Pockets/Phases” of program need to behave as before

• Inner kernels don’t change

• Low learning curve and easy change-over for MPI versed users

Need to work with external resource managers and runtimes

• Two-Way Communication of the MPI runtime with the outside world

• Ability to express needs and requirements

• Ability to react to external events and changing conditions

• Has to be able to support resource sharing

• Should still be agnostic to other resource usages

Application logic needs to stay in the application

• No automatic data re-distribution

• No rewriting of application state

Things to Consider Towards a Malleable MPI

Page 6: Adapt or Die: The Challenge for MPI in a Post-Exascale World

Research project on “Invasic Computing”

• State of the art scientific applications utilize algorithms with evolving properties

• AMR, changing meshes, …

• The current assignment of fixed resources to these applications is suboptimal

➢ “Invading instead of wasting HPC resources”

Approach:

• Resource manager initiates

shrink or grow

• Application checks for changes

during adaptation window

• Typically done at iteration

boundaries

• Enables controlled change of

MPI_COMM_WORLD

Example: iMPI, aka. Elastic MPI

Gerndt, Compres, et al.

Page 7: Adapt or Die: The Challenge for MPI in a Post-Exascale World

MPI_Init_adapt(…)

• Initializes the library in adaptive mode

MPI_Probe_adapt(…)

• Probes the resource manager for adaptations

MPI_Comm_adapt_begin(…)

• Marks the beginning of an adaptation window

• Provides a set of helper communicators

MPI_Comm_adapt_commit(…)

• Marks the end of an adaptation window

• Sets adapted MPI_COMM_WORLD

Proposed Resource Negotiation API

Gerndt, Compres, et al.

Page 8: Adapt or Die: The Challenge for MPI in a Post-Exascale World

int main ( int argn , char ** argc ){

MPI_Init_adapt (& argn , & argc , & local_status );

for (...){

MPI_Probe_adapt (&adapt, ...);

if( local_status == MPI_ADAPT_STATUS_JOINING

|| adapt == MPI_ADAPT_TRUE){

MPI_Comm_adapt_begin (...);

// adaptation window's body with

// data redistribution code

MPI_Comm_adapt_commit (...);

}

// compute and MPI code

}

return 0;

}

Code Example

Gerndt, Compres, et al.

Page 9: Adapt or Die: The Challenge for MPI in a Post-Exascale World

Example: Earthquake Simulation

Gerndt, Compres, et al.

Page 10: Adapt or Die: The Challenge for MPI in a Post-Exascale World

How to integrate this into the MPI standard?

• What kind of additivity to support?

• Cooperative or not?

• How fine grained should/coarse grained can the adaptation be?

• How to interface with runtimes and resource managers?

• New SRI initiative based on ideas from PMI and PMIx

• How to guide/decide on adaptation?

How Could We Extend the Concept of iMPI

Page 11: Adapt or Die: The Challenge for MPI in a Post-Exascale World

• Capturing best configurations

• Predicting runtime and power usage

• Estimating network load

Prerequisite: continuous monitoring

• Across all components and systems

• Across the entire software stack

• For all users and applications

Need To Understand Workloads

Stac

k-w

ide

Dat

a C

olle

ctio

n &

Se

man

tic

Co

rrel

atio

n

Application

Hardware

CPU NUMA Netw.

OS/Comm.

MPI Thrds. Tasks

Prg. Model

Msg. PGAS DSL

Libraries

Page 12: Adapt or Die: The Challenge for MPI in a Post-Exascale World

LRZ’s Data Center DataBase (DCDB)

Legend

Planned Ongoing In Use

Operations

Monitoring

REST API

libdcdb

Database

Interface

dcdbpusher

IPMI

plugin

perf events

pluginXML plugin

(Clustsafe)

SNMP

plugin

BACnet

plugin

sysfs

plugin

User/Admin Interface

REST API

Sensor

CacheControl

Collect Agent

MQTT

Server

Database

Interface

Sensor Data

Cache

Data

Analysis

Source: Michael Ott, Daniele Tafani, LRZ

Page 13: Adapt or Die: The Challenge for MPI in a Post-Exascale World

• Capturing best configurations

• Predicting runtime and power usage

• Estimating network load

Prerequisite: continuous monitoring

• Across all components and systems

• Across the entire software stack

• For all users and applications

Must include application information

• Key: progress information

• Interfaces like LLNL’s Caliper can help

• Integration of low-level resources like hardware counters

From the MPI perspective:

• Must include MPI internal configuration information → MPI_T CVARs

• Must include MPI internal performance data → MPI_T PVARs

• Must include adaptivity trigger points within MPI → MPI_T Events

• Must include analysis hooks from several layers

Need To Understand Workloads

Stac

k-w

ide

Dat

a C

olle

ctio

n &

Se

man

tic

Co

rrel

atio

n

Application

Hardware

CPU NUMA Netw.

OS/Comm.

MPI Thrds. Tasks

Prg. Model

Msg. PGAS DSL

Libraries

Page 14: Adapt or Die: The Challenge for MPI in a Post-Exascale World

MPI Profiling Interface offers convenient hooks

• One tool, user controlled

• Ideal for performance tools

Hooks for Ubiquitous AnalysisApplication

MPI Library

Profiling Tool

Page 15: Adapt or Die: The Challenge for MPI in a Post-Exascale World

MPI Profiling Interface offers convenient hooks

• One tool, user controlled

• Ideal for performance tools

Mechanisms to support multiple tools

• Tool stacks (e.g., PnMPI)

• Still under user control

Needed: tool stacks with multiple players

• System configurations

• Runtime enhancements

• Tools

Hooks for Ubiquitous AnalysisApplication

MPI Library

Tracing Tool

Profiling Tool

Page 16: Adapt or Die: The Challenge for MPI in a Post-Exascale World

MPI Profiling Interface offers convenient hooks

• One tool, user controlled

• Ideal for performance tools

Mechanisms to support multiple tools

• Tool stacks

• Still under user control

Needed: tool stacks with multiple players

• System configurations

• Runtime enhancements

• Tools

• Under one roof, with one interface

• Independent from each other

• Function pointer based

Hooks for Ubiquitous AnalysisApplication

MPI Library

Tracing Tool

Profiling Tool

Application

MPI LibraryM

PI L

ibra

ry End-User Tool

System Monitor

Power Runtime

Page 17: Adapt or Die: The Challenge for MPI in a Post-Exascale World

MPI Profiling Interface offers convenient hooks

• One tool, user controlled

• Ideal for performance tools

Mechanisms to support multiple tools

• Tool stacks

• Still under user control

Needed: tool stacks with multiple players

• System configurations

• Runtime enhancements

• Tools

• Under one roof, with one interface

• Independent from each other

• Function pointer based

➢ Code name: QMPI

Prototype coming to a system near you soon!

Hooks for Ubiquitous AnalysisApplication

MPI Library

Tracing Tool

Profiling Tool

Application

MPI LibraryM

PI L

ibra

ry

End-User Tool

System Monitor

Power Runtime

Page 18: Adapt or Die: The Challenge for MPI in a Post-Exascale World

How to integrate this into the MPI standard?

• What kind of additivity to support?

• Cooperative or not?

• How fine grained should the adaptation be?

• How to interface with runtimes and resource managers?

• New SRI initiative based on ideas from PMI and PMIx

• How to guide adaptation?

Adaptation of communicators is problematic

• iMPI only supports COMM_WORLD

• Need subsetting

• Changes assumptions of application significantly

• Adaptation of libraries not simple

If we only had a way to provide

• Initialize multiple and possibly changing COMM_WORLDS

• Provide isolation between communicators

• Query process sets from the runtime

Coming back to:How Could We Extend the Concept of iMPI

Page 19: Adapt or Die: The Challenge for MPI in a Post-Exascale World

Started as simple “local” initialization

• Grown to an “amorphous being”

• Could be a concept to support adaptivity

• Could be a concept to support fault tolerance

• Could be a concept to support isolation

Basic scheme

1. Get local access to the MPI library

Get a Session Handle

2. Query the underlying run-time system

Get a “set” of processes

3. Determine the processes you want

Create an MPI_Group

4. Create a communicator with just those processes

Create an MPI_Comm

MPI Sessions Is the Answer!Is MPI Sessions the Answer?

MPI

Sessions

MPI_Session

Set of processes

MPI_Group

MPI_Comm

Page 20: Adapt or Die: The Challenge for MPI in a Post-Exascale World

MPI Session’s intended goals

• No more implicit MPI_COMM_WORLD

• Enable runtime information to flow into MPI

• Creation of communicators without parent communicators

Within a single MPI Session

• Query process sets

• Derive (static) groups

• Derive (static) communicators

• Static MPI bubbles/universes/enclaves …

Is MPI Sessions the Answer? Perhaps

MPI_Session

Set of processes

MPI_Group

MPI_Comm

Set of processes

Page 21: Adapt or Die: The Challenge for MPI in a Post-Exascale World

What if …

Options

Page 22: Adapt or Die: The Challenge for MPI in a Post-Exascale World

What if: process sets could change over time?

Within a session

• Query new set sizes

• Create new communicators

• User code for proper switch-over

Issue 1: How to ensure everyone sees the same process set?

• Set versioning

• Communicator creation fails if derived from groups derived from sets with

different versions

• Iteration until successful creation of a new communicator

Issue 2: Integration of new MPI processes

• Need trigger mechanism to notify existing processes

• Enlarged communicator created with new processes

• Distribute data

• Free old communicator

Option 1: Dynamic Set Management

Page 23: Adapt or Die: The Challenge for MPI in a Post-Exascale World

What if: a runtime/RM could influence/terminate a session?

On changing resources a runtime invalidates a session bubble

• Based on the used process set(s)

• Once bubble is invalidated

• Either disallow communication (return error)

• Or issue warning

Application can/has to react to runtime input

• Create new session bubble with new processes

• Redistribute data

• Cleanup and delete old session bubble

Could also be a clean recipe for fault tolerance

• Resource isolation

• Clean composability

Option 2. Runtime Impact on Bubbles

Page 24: Adapt or Die: The Challenge for MPI in a Post-Exascale World

Ability to reason about all MPI objects that are

• … derived from the same local local session(s)

• … part of the same process groups

• … create their own isolated resources

These objects form a natural group

• Isolated from the world in terms of communication

• Could be revoked without global impact

• Granularity of malleability

• Keeping flavor of MPI

• Maintaining coarse granularity

• Clean concept for FT

Open issue: not „sessioned“ MPI objects

• Datatypes

• Info objects

• MPI Tools Information Interface

Let‘s Make Pigs Fly:Towards a Global Session (Bubbles) Concept

Page 25: Adapt or Die: The Challenge for MPI in a Post-Exascale World

Systems Require More Adaptivity

• Key to support future resource constraint systems

• But: the current MPI is not flexible enough

We need the ability to dynamically adapt

• Step 1: Introspection

MPI_T & QMPI are important parts of this

• Step 2: Support runtime/RM interactions

Two-way negotiation abilities → SRI efforts

• Step 3: Enable changing resources

Growing and shrinking process sets

MPI Sessions is the a good step, but …• Need ability to reason about global concepts

• Need ability to query external influence

• Need ability to capture and query changes

Needs to be Use Case Driven - Join the Discussion!

MPI Needs to Get More Adaptive