xsim – the extreme-scale simulator€¦ · xsim – the extreme-scale simulator . janko...

www.bsc.es

xSim – The Extreme-Scale Simulator

Janko Strassburg

Severo Ochoa Seminar @ BSC, 28 Feb 2014

Motivation

Future exascale systems are predicted to have hundreds of thousands of nodes, thousands of processors per node. Limited application scalability due to sequential parts, synchronizing communication and other bottlenecks

Investigating performance of parallel applications at scale is an important component of HPC hardware/software co-design Behaviour on future architectures Performance impact of architecture choices

Overview

Several existing simulators include JCAS, BigSim, Dimemas, MuPi – Limitations (Run time, no of concurrent threads executed, ..)

Highly scalable solution – trade off accuracy in exchange for scalability – Nodes oversubscribed for simulation – Highly accurate simulations are extremely slow and less scalable

“The Extreme-scale Simulator permits running an HPC application in a controlled environment with millions of concurrent execution threads while observing its performance in a simulated extreme-scale HPC system using architectural models and virtual timing.”

Overview Parallel discrete event simulation (PDES) to emulate the behaviour of various architecture models Execution of real applications, algorithms or their models atop a simulated HPC environment for: – Performance evaluation, including identification of resource contention and

underutilization issues – Investigation at extreme scale, beyond the capabilities of existing simulation

efforts

© S. Boehm and C. Engelmann. xSim: The Extreme-Scale Simulator. HPCS 2011, Istanbul, Turkey, July 4-8, 2011.

Overview Combining highly oversubscribed execution, a virtual MPI, and a time-accurate PDES (Parallel discrete event simulation) PDES uses the native MPI and simulates virtual processors The virtual processors expose a virtual MPI to applications Multithreaded MPI implementation needed (e.g. Open MPI “--enable-mpi-thread-multiple”)

© 2010 IEEE Cluster Co-Design Workshop

Overview The simulator is a library Utilizes PMPI to intercept MPI calls and to hide the PDES Easy to use: – Replace the MPI header – Compile and link with the

simulator library – Run the MPI program

Support for C and Fortran MPI applications


Overview

xSim is designed like a traditional performance tool – Interposition library sitting between MPI application and MPI library – Uses simulated wall clock time for measurement – Performance data extracted based on processor and network model – MPI performance tool interface (PMPI)

Supports – Simulated MPI point-to-point communication (essential calls) – Simulated MPI data types, groups, communicators, collective

communication (full) – 81 simulated MPI calls for each C and Fortran – ULFM MPI extensions

Comparison to Dimemas

Online simulator, change in model requires rerun – Batch simulations through scripts, configuration files, command line

options – Application model available

Oversubscribes nodes – Larger simulations than underlying system possible

Support of multi threaded MPI implementations – Fault Tolerance support through Open MPI ULFM – Version with locks available, albeit significantly slower

• No two calls in MPI comm at the same time • Non blocking calls -> blocking calls

Simulation Models

Processor model – Based on actual execution time of underlying hardware – Scaled up for simulated processor speed – Heterogeneous cores with differing speeds

Support for various network architecture models – Analyze existing hardware conditions / experiment with differing

architectures – Latency and bandwidth restrictions – Hierarchical combinations

• Network on chip • Network on node

– Sender/Receiver contention simulation – Full contention not supported due to scalability reasons

Simulation Models

Application model – Similar to MPI trace replays – Same timing and communication behaviour – Certain resources not needed to scale with simulation

• Memory usage – Advances virtual time for application between MPI calls – Execute MPI calls without actually sending data (no need for buffers)

Operating system noise simulation File system model – Currently in development – Read/Write delay, access time, congestion, …

Network Models

Unidirectional Ring Star

Tree Mesh

Network Models

Torus

Twisted Torus

Network Models

Twisted Torus with Toroidal Jump

Twisted Torus with Toroidal Degree

General Usage of xSim

Add header files – #include xsim-c.h – #include xsim-f.h

Recompile and link with library – Library flag “-lxsim” – Programming language interface flags “-lxsim-c” or “-lxsim-f”

Run application in the simulator – mpirun -np <real process count> <application> <application args>

-xsim-np <virtual process count> <xsim args>

Examples – Hello World

Scaling hello world from 1000 to 100,000,000 cores – Native system : 12-core 2-processor 39-node Gig. Ethernet – Simulated system : 100,000,000 processor Gigabit Ethernet

xSim runs on up to 936 AMD Opteron cores and 2.5 TB RAM – 468 or 936 cores needed for 100,000,000 simulated processes – 100,000,000 x 8 kB = 800 GB in virtual MPI process stack

936 cores Simulated Time

936 xSim runtime

Examples – Basic Network Model

Model allows to define network architecture, latency and bandwidth Basic star network Model can be set to 0μs and ∞Gbps as baseline 50μs and 1Gbps roughly represented the native test environment – 4 Intel dual-core 2.13GHz nodes with 2GB of memory each – Ubuntu 8.04 64-bit Linux – Open MPI 1.4.2 with multi-threading support


Example – Processor Model

Model allows to scale relative speed to different processor Basic scaling model Model can be set to 1.0x for baseline numbers MPI hello world scales to 1M+ VPs on 4 nodes with 4GB total stack (4kB/VP) Simulation (application) – Constant execution time – <1024 VPs: Noisy clock

Simulator – >256 VPs: Output buffer

issues


Simulated 0µs/∞Gbps/1.0x

xSim run time 0µs/∞Gbps/1.0x

Example – Basic PI Monte Carlo Solver

Network model: – Star, 50μs and 1Gbps

Processor model – 1x (32kB stack/VP) – 0.5x (32kB stack/VP)

Simulation – Perfect scaling

Simulator – <= 8 VPs: 0% overhead on the 8 processor cores – >= 4096 VPs: comm. load dominates


xSim run time 50µs/1Gbps/1.0x

xSim run time 50µs/1Gbps/0.5x

Simulated time 50µs/1Gbps/1.0x Simulated time 50µs/1Gbps/0.5x

Examples – NAS Parallel Benchmark

Scaling CG and EP class B problems – 1 to 128 simulated cores

Native system: 4 core, 2 processor, 16 node

Examples – NAS Parallel Benchmark

Scaling CG and EP class A problems – CG 1 to 4096 simulated cores – EP 1 to 16384 cores

Native system: 4 core, 2 processor, 16 node

CG.B Simulated time CG.B Total run time EP.B Total run time EP.B Simulated time

Examples – MCMI Core Scaling

960 Core system 240 cores for simulation due to memory bandwidth restrictions

Linear behaviour up to 2000x2000 matrix size Slight degradation for larger problem sizes

Examples – MCMI Problem Scaling

Simulator also gathers MPI statistics Linear increase of exchanged messages

Examples – MCMI MPI Message Count Scaling

Fault Tolerance Properties

Fault tolerance is a property of a program, not of an API specification or an implementation. Within certain constraints, MPI can provide a useful context for writing application programs that exhibit significant degrees of fault tolerance.

Advanced Features – Resilience and Fault Tolerance

xSim fully supports error handling within simulated MPI – Default MPI error handlers – User-defined MPI error handlers – MPI_Abort()

Simulated abort terminates simulation and provides – Performance results – Source of abort – Time of abort

Advanced Features – Resilience and Fault Tolerance

Developing and debugging of FT applications – Permits injection of MPI process failures – Propagate/detection/notification of failures within simulation – Handle application-level checkpoint/restart

Observation of application behaviour and performance under failure possible Support for User-Level Failure Mitigation (ULFM) extension – Investigate and develop Algorithm-Based Fault Tolerance (ABFT)

applications – xSim is the first performance tool to support ULFM and ABFT

User-Level Failure Mitigation (ULFM)

Fault-tolerant MPI extension Proposed by MPI 3.0 Fault Tolerance Working Group To be presented and voted upon in the MPI forum these coming months for integration in upcoming MPI 3.1 standard “Minimal set of changes necessary for applications and libraries to include fault tolerance techniques and to construct more forms of fault tolerance (transactions, strongly consisten collectives, etc.)”


Three main concepts – Simplicity

• API easy to use and understand – Flexibility

• API to allow for varied fault tolerant models to be built as external libraries – Absence of deadlock

• No MPI call (point-to-point or collective) can block indefinitely after a failure • Calls must either succeed or raise an MPI error

Default error handler needs to be changed to use ULFM – On at least MPI_COMM_WORLD from

MPI_ERRORS_ARE_FATAIL to MPI_ERRORS_RETURN or custom MPI Errorhandler


Exceptions raised – MPI_ERR_PROC_FAILED – MPI_ERR_PROC_FAILED_PENDING – MPI_ERR_REVOKED

Acknowledge – MPI_Comm_failure_ack – MPI_Comm_failure_get_acked

Handling – MPI_Comm_shrink – MPI_Comm_revoke – MPI_Comm_agree – MPI_Comm_iagree

ULFM in xSim

MPI_Comm_revoke() – Linear broadcast of failure notification through the simulated runtime – No matching receive – Releases any waited on send or receive

request with MPI_ERR_PROC_FAILED | MPI_ERR_PROC_FAILED_PENDING

MPI_Comm_shrink() – Two-phase commit protocol to establish the list of failed MPI ranks – Fault-tolerant linear reduction and broadcast operations

MPI_Comm_agree() – Agreement on a single value (logical AND operation by live members) – Fault-tolerant linear MPI_Allreduce() implementation

MPI_Comm_failure_ack() and MPI_Comm_failure_get_acked() – Failure registry per rank and per communicator with low memory

overhead (bit arrays)

Conclusions The Extreme-scale Simulator (xSim) is a performance investigation toolkit Uses oversubscription to model systems larger than underlying hardware Supports processor, network, application and noise models – File system model under development

First performance toolkit to support MPI process failure injection, checkpoint/restart and ULFM Forecast behaviour on varying systems possible Time and resource saving via simulation

xsim – the extreme-scale simulator€¦ · xsim – the extreme-scale simulator . janko...

Documents