xsim – the extreme-scale simulator€¦ · xsim – the extreme-scale simulator . janko...
TRANSCRIPT
www.bsc.es
xSim – The Extreme-Scale Simulator
Janko Strassburg
Severo Ochoa Seminar @ BSC, 28 Feb 2014
Motivation
Future exascale systems are predicted to have hundreds of thousands of nodes, thousands of processors per node. Limited application scalability due to sequential parts, synchronizing communication and other bottlenecks
Investigating performance of parallel applications at scale is an important component of HPC hardware/software co-design Behaviour on future architectures Performance impact of architecture choices
Overview
Several existing simulators include JCAS, BigSim, Dimemas, MuPi – Limitations (Run time, no of concurrent threads executed, ..)
Highly scalable solution – trade off accuracy in exchange for scalability – Nodes oversubscribed for simulation – Highly accurate simulations are extremely slow and less scalable
“The Extreme-scale Simulator permits running an HPC application in a controlled environment with millions of concurrent execution threads while observing its performance in a simulated extreme-scale HPC system using architectural models and virtual timing.”
Overview Parallel discrete event simulation (PDES) to emulate the behaviour of various architecture models Execution of real applications, algorithms or their models atop a simulated HPC environment for: – Performance evaluation, including identification of resource contention and
underutilization issues – Investigation at extreme scale, beyond the capabilities of existing simulation
efforts
© S. Boehm and C. Engelmann. xSim: The Extreme-Scale Simulator. HPCS 2011, Istanbul, Turkey, July 4-8, 2011.
Overview Combining highly oversubscribed execution, a virtual MPI, and a time-accurate PDES (Parallel discrete event simulation) PDES uses the native MPI and simulates virtual processors The virtual processors expose a virtual MPI to applications Multithreaded MPI implementation needed (e.g. Open MPI “--enable-mpi-thread-multiple”)
© 2010 IEEE Cluster Co-Design Workshop
Overview The simulator is a library Utilizes PMPI to intercept MPI calls and to hide the PDES Easy to use: – Replace the MPI header – Compile and link with the
simulator library – Run the MPI program
Support for C and Fortran MPI applications
© 2010 IEEE Cluster Co-Design Workshop
Overview
xSim is designed like a traditional performance tool – Interposition library sitting between MPI application and MPI library – Uses simulated wall clock time for measurement – Performance data extracted based on processor and network model – MPI performance tool interface (PMPI)
Supports – Simulated MPI point-to-point communication (essential calls) – Simulated MPI data types, groups, communicators, collective
communication (full) – 81 simulated MPI calls for each C and Fortran – ULFM MPI extensions
Comparison to Dimemas
Online simulator, change in model requires rerun – Batch simulations through scripts, configuration files, command line
options – Application model available
Oversubscribes nodes – Larger simulations than underlying system possible
Support of multi threaded MPI implementations – Fault Tolerance support through Open MPI ULFM – Version with locks available, albeit significantly slower
• No two calls in MPI comm at the same time • Non blocking calls -> blocking calls
Simulation Models
Processor model – Based on actual execution time of underlying hardware – Scaled up for simulated processor speed – Heterogeneous cores with differing speeds
Support for various network architecture models – Analyze existing hardware conditions / experiment with differing
architectures – Latency and bandwidth restrictions – Hierarchical combinations
• Network on chip • Network on node
– Sender/Receiver contention simulation – Full contention not supported due to scalability reasons
Simulation Models
Application model – Similar to MPI trace replays – Same timing and communication behaviour – Certain resources not needed to scale with simulation
• Memory usage – Advances virtual time for application between MPI calls – Execute MPI calls without actually sending data (no need for buffers)
Operating system noise simulation File system model – Currently in development – Read/Write delay, access time, congestion, …
Network Models
Unidirectional Ring Star
Tree Mesh
Network Models
Torus
Twisted Torus
Network Models
Twisted Torus with Toroidal Jump
Twisted Torus with Toroidal Degree
General Usage of xSim
Add header files – #include xsim-c.h – #include xsim-f.h
Recompile and link with library – Library flag “-lxsim” – Programming language interface flags “-lxsim-c” or “-lxsim-f”
Run application in the simulator – mpirun -np <real process count> <application> <application args>
-xsim-np <virtual process count> <xsim args>
Examples – Hello World
Scaling hello world from 1000 to 100,000,000 cores – Native system : 12-core 2-processor 39-node Gig. Ethernet – Simulated system : 100,000,000 processor Gigabit Ethernet
xSim runs on up to 936 AMD Opteron cores and 2.5 TB RAM – 468 or 936 cores needed for 100,000,000 simulated processes – 100,000,000 x 8 kB = 800 GB in virtual MPI process stack
936 cores Simulated Time
936 xSim runtime
Examples – Basic Network Model
Model allows to define network architecture, latency and bandwidth Basic star network Model can be set to 0μs and ∞Gbps as baseline 50μs and 1Gbps roughly represented the native test environment – 4 Intel dual-core 2.13GHz nodes with 2GB of memory each – Ubuntu 8.04 64-bit Linux – Open MPI 1.4.2 with multi-threading support
© 2010 IEEE Cluster Co-Design Workshop
Example – Processor Model
Model allows to scale relative speed to different processor Basic scaling model Model can be set to 1.0x for baseline numbers MPI hello world scales to 1M+ VPs on 4 nodes with 4GB total stack (4kB/VP) Simulation (application) – Constant execution time – <1024 VPs: Noisy clock
Simulator – >256 VPs: Output buffer
issues
© 2010 IEEE Cluster Co-Design Workshop
Simulated 0µs/∞Gbps/1.0x
xSim run time 0µs/∞Gbps/1.0x
Example – Basic PI Monte Carlo Solver
Network model: – Star, 50μs and 1Gbps
Processor model – 1x (32kB stack/VP) – 0.5x (32kB stack/VP)
Simulation – Perfect scaling
Simulator – <= 8 VPs: 0% overhead on the 8 processor cores – >= 4096 VPs: comm. load dominates
© 2010 IEEE Cluster Co-Design Workshop
xSim run time 50µs/1Gbps/1.0x
xSim run time 50µs/1Gbps/0.5x
Simulated time 50µs/1Gbps/1.0x Simulated time 50µs/1Gbps/0.5x
Examples – NAS Parallel Benchmark
Scaling CG and EP class B problems – 1 to 128 simulated cores
Native system: 4 core, 2 processor, 16 node
Examples – NAS Parallel Benchmark
Scaling CG and EP class A problems – CG 1 to 4096 simulated cores – EP 1 to 16384 cores
Native system: 4 core, 2 processor, 16 node
CG.B Simulated time CG.B Total run time EP.B Total run time EP.B Simulated time
Examples – MCMI Core Scaling
960 Core system 240 cores for simulation due to memory bandwidth restrictions
Linear behaviour up to 2000x2000 matrix size Slight degradation for larger problem sizes
Examples – MCMI Problem Scaling
Simulator also gathers MPI statistics Linear increase of exchanged messages
Examples – MCMI MPI Message Count Scaling
Fault Tolerance Properties
Fault tolerance is a property of a program, not of an API specification or an implementation. Within certain constraints, MPI can provide a useful context for writing application programs that exhibit significant degrees of fault tolerance.
Advanced Features – Resilience and Fault Tolerance
xSim fully supports error handling within simulated MPI – Default MPI error handlers – User-defined MPI error handlers – MPI_Abort()
Simulated abort terminates simulation and provides – Performance results – Source of abort – Time of abort
Advanced Features – Resilience and Fault Tolerance
Developing and debugging of FT applications – Permits injection of MPI process failures – Propagate/detection/notification of failures within simulation – Handle application-level checkpoint/restart
Observation of application behaviour and performance under failure possible Support for User-Level Failure Mitigation (ULFM) extension – Investigate and develop Algorithm-Based Fault Tolerance (ABFT)
applications – xSim is the first performance tool to support ULFM and ABFT
User-Level Failure Mitigation (ULFM)
Fault-tolerant MPI extension Proposed by MPI 3.0 Fault Tolerance Working Group To be presented and voted upon in the MPI forum these coming months for integration in upcoming MPI 3.1 standard “Minimal set of changes necessary for applications and libraries to include fault tolerance techniques and to construct more forms of fault tolerance (transactions, strongly consisten collectives, etc.)”
User-Level Failure Mitigation (ULFM)
Three main concepts – Simplicity
• API easy to use and understand – Flexibility
• API to allow for varied fault tolerant models to be built as external libraries – Absence of deadlock
• No MPI call (point-to-point or collective) can block indefinitely after a failure • Calls must either succeed or raise an MPI error
Default error handler needs to be changed to use ULFM – On at least MPI_COMM_WORLD from
MPI_ERRORS_ARE_FATAIL to MPI_ERRORS_RETURN or custom MPI Errorhandler
User-Level Failure Mitigation (ULFM)
Exceptions raised – MPI_ERR_PROC_FAILED – MPI_ERR_PROC_FAILED_PENDING – MPI_ERR_REVOKED
Acknowledge – MPI_Comm_failure_ack – MPI_Comm_failure_get_acked
Handling – MPI_Comm_shrink – MPI_Comm_revoke – MPI_Comm_agree – MPI_Comm_iagree
ULFM in xSim
MPI_Comm_revoke() – Linear broadcast of failure notification through the simulated runtime – No matching receive – Releases any waited on send or receive
request with MPI_ERR_PROC_FAILED | MPI_ERR_PROC_FAILED_PENDING
MPI_Comm_shrink() – Two-phase commit protocol to establish the list of failed MPI ranks – Fault-tolerant linear reduction and broadcast operations
MPI_Comm_agree() – Agreement on a single value (logical AND operation by live members) – Fault-tolerant linear MPI_Allreduce() implementation
MPI_Comm_failure_ack() and MPI_Comm_failure_get_acked() – Failure registry per rank and per communicator with low memory
overhead (bit arrays)
Conclusions The Extreme-scale Simulator (xSim) is a performance investigation toolkit Uses oversubscription to model systems larger than underlying hardware Supports processor, network, application and noise models – File system model under development
First performance toolkit to support MPI process failure injection, checkpoint/restart and ULFM Forecast behaviour on varying systems possible Time and resource saving via simulation