analysis of cluster failures on blue gene supercomputers
DESCRIPTION
Analysis of Cluster Failures on Blue Gene Supercomputers. Tom Hacker * Fabian Romero +. Chris Carothers. Scientific Computation Research Center Department of Computer Science Rensselaer Polytechnic Institute. Depart. Of Computer & Information Tech. * Discovery Cyber Center + - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Analysis of Cluster Failures on Blue Gene Supercomputers](https://reader036.vdocument.in/reader036/viewer/2022081514/56813ff1550346895dab0683/html5/thumbnails/1.jpg)
1
Analysis of Cluster Failures on Blue Gene Supercomputers
Tom Hacker*
Fabian Romero+
Scientific Computation Research CenterDepartment of Computer ScienceRensselaer Polytechnic Institute
Depart. Of Computer & Information Tech.*
Discovery Cyber Center+
Purdue University
Chris Carothers
![Page 2: Analysis of Cluster Failures on Blue Gene Supercomputers](https://reader036.vdocument.in/reader036/viewer/2022081514/56813ff1550346895dab0683/html5/thumbnails/2.jpg)
2
Outline Update on NSF PetaApps CFD Project
PetaApps Project Components Current Scaling Results Challenges for Fault Tolerance
Analysis of Clustered Failures* EPFL – 8K Blue Gene/L RPI – 32K Blue Gene/L
Analysis Approach Findings Summary
*To appear in upcoming issue of JPDC
![Page 3: Analysis of Cluster Failures on Blue Gene Supercomputers](https://reader036.vdocument.in/reader036/viewer/2022081514/56813ff1550346895dab0683/html5/thumbnails/3.jpg)
3
NSF PetaApps: Parallel Adaptive CFDPetaApps Components
CFD Solver Adaptivity Petascale Perf Sim
Fault Recovery Demonstration Apps
Cardiovascular Flow Flow Control Two-phase Flow
Ken Jansen (PD), Onkar Sahni,
Chris Carothers, Mark S. Shephard
Scientific Computation Research Center
Rensselaer Polytechnic Institute
Acknowledgments: Partners: Simmetrix, Acusim, Kitware, IBM NSF: PetaApps, ITR, CTS; DOE: SciDAC-ITAPS, NERI; AFOSR Industry:IBM, Northrup Grumman, Boeing, Lockheed Martin, Motorola Computer Resources: TeraGrid, ANL, NERSC, RPI-CCNI
![Page 4: Analysis of Cluster Failures on Blue Gene Supercomputers](https://reader036.vdocument.in/reader036/viewer/2022081514/56813ff1550346895dab0683/html5/thumbnails/4.jpg)
4
PHASTA Flow Solver Parallel Paradigm
Time-accurate, stabilized FEM flow solverTwo types of work:
Equation formation O(40) peer-to-peer non/blocking comms Overlapping comms with comp Scales well on many machines
Implicit, iterative equation solution Matrix assembled on processor ONLY Each Krylov vector is:
q=Ap (matrix-vector product) Same peer-to-peer comm of q PLUS Orthogonalize against prior vectors REQUIRES NORMS=>MPI_Allreduce
This sets up a cycle of global comms. separated by modest amount of work Not currently able to overlap Comms Even if work is balanced perfectly, OS jitter can imbalance it. Imbalance WILL show up in MPI_Allreduce Scales well on machines with low noise (like Blue Gene)
P1 P2
P3
![Page 5: Analysis of Cluster Failures on Blue Gene Supercomputers](https://reader036.vdocument.in/reader036/viewer/2022081514/56813ff1550346895dab0683/html5/thumbnails/5.jpg)
5
Parallel Implicit Flow Solver – IncompressibleAbdominal Aorta Aneurysm (AAA) fafdfafd adsf a
Cores (avg. elems./core)
IBM BG/L RPI-CCNI
t (secs.) scale factor
512 (204800) 2119.7 1 (base)
1024 (102400) 1052.4 1.01
2048 (51200) 529.1 1.00
4096 (25600) 267.0 0.99
8192 (12800) 130.5 1.02
16384 (6400) 64.5 1.03
32768 (3200) 35.6 0.93
32K parts show modest degradation due to 15% node imbalance
(with only about 600 mesh-nodes/part)
+++++++++++++++++++++
+++++++++++++++++++++++++++++++
++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++
x
x
x
x
x
x
x
xxxx
x
x
x
xx
x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
xxxx
xx
x
x
x
x
xxx
x
x
xxx
x
x
x
x
x
x
xxxx
x
x
x
x
xxx
x
x
xx
x
x
x
x
x
xxx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xxx
x
xx
x
x
xx
x
x
x
x
x
xx
xx
xx
xxx
x
xx
x
xx
xxxx
x
x
x
x
xxx
x
x
x
x
x
x
xx
x
x
xxx
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xxx
x
x
x
xxxx
x
x
x
x
x
x
xx
xx
x
x
x
x
x
x
x
x
xx
xx
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
xxxx
x
xx
x
x
x
x
x
xx
x
x
x
x
x
x
xx
x
x
x
xxxx
xx
x
x
x
x
x
x
x
xx
xxx
x
x
x
xxxxx
x
x
x
x
x
xx
xx
x
x
x
x
x
x
xxx
xxxxx
xx
x
x
x
xx
x
x
xx
xx
xx
x
x
xx
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
xx
x
x
x
x
x
x
x
xx
xx
x
x
x
x
x
x
xx
x
x
x
x
xx
x
xxxx
x
x
x
x
x
x
xxx
x
x
x
x
x
x
xx
x
x
x
xx
x
x
x
x
x
xx
x
x
x
x
x
x
x
xx
x
x
x
x
x
xx
x
xxxxx
xx
x
x
xx
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
xx
x
xx
xx
x
xx
x
x
xxx
x
xx
x
xx
x
x
x
x
x
x
x
x
x
xx
xx
x
x
x
x
x
xx
x
x
xx
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
xx
x
x
xxx
x
xxx
x
xxx
x
x
xx
x
x
x
x
x
x
xx
x
xx
x
x
x
xx
x
x
x
x
x
x
xx
xx
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
xxx
x
x
x
x
x
x
xx
x
x
x
x
xx
x
xx
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xxx
x
x
x
xx
x
x
x
xxxx
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
xx
x
x
xx
x
x
x
xxx
x
x
x
xx
x
x
x
x
x
xx
x
xx
x
xxx
xx
x
x
x
x
x
x
x
x
x
xx
xxx
x
x
xxx
x
x
x
x
x
xx
x
xx
x
xxxx
x
x
xx
xx
x
x
x
x
x
x
x
xx
x
x
xx
x
xx
xx
xx
x
x
x
x
x
x
x
x
x
xx
x
x
xx
x
x
x
xx
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
xxx
x
x
xx
xxx
x
x
x
x
x
x
x
xxx
x
x
x
xxx
x
x
xx
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
xxx
x
xx
x
x
xx
x
x
xx
xxx
xx
x
x
x
x
x
x
x
x
x
xxxx
x
x
x
x
x
x
x
xxxx
x
xx
x
x
x
x
x
x
x
x
x
x
xxx
xx
x
x
x
x
x
x
x
xx
xx
x
x
x
x
x
x
xxx
x
x
x
x
x
xx
x
x
x
xxx
x
proc number0 250 500 750 1000
0.88
0.92
0.96
1
1.04
1.08
1.12
1.16region rationode ratio
+
x
Rgn./elem. ratioi = rgnsi/avg_rgns
Node ratioi = nodesi/avg_nodes
(Min Zhou)
![Page 6: Analysis of Cluster Failures on Blue Gene Supercomputers](https://reader036.vdocument.in/reader036/viewer/2022081514/56813ff1550346895dab0683/html5/thumbnails/6.jpg)
6
Scaling of “AAA” 105M Case
Scaling loss due to OS jitter in MPI_Allreduce
![Page 7: Analysis of Cluster Failures on Blue Gene Supercomputers](https://reader036.vdocument.in/reader036/viewer/2022081514/56813ff1550346895dab0683/html5/thumbnails/7.jpg)
AAA Adapted to 109 Elements:Scaling on Blue Gene /P
#of cores
Rgn imb
Vtx imb
Time (s)
Scaling
16k 2.03% 7.13% 222.03 132k 1.72% 8.11% 112.43 0.98764k 1.6% 11.18% 57.09 0.972
128k5.49%
17.85%
31.35 0.885
![Page 8: Analysis of Cluster Failures on Blue Gene Supercomputers](https://reader036.vdocument.in/reader036/viewer/2022081514/56813ff1550346895dab0683/html5/thumbnails/8.jpg)
ROSS: Massively Parallel Discrete-Event Simulation
Local Control Mechanism:error detection and rollback
LP 1 LP 2 LP 3
Virtual
Ti
me
undostate ’s
(2) cancel“sent” events
Global Control Mechanism:compute Global Virtual Time (GVT)
LP 1 LP 2 LP 3
Virtual
Ti
me
GVT
collect versionsof state / events& perform I/O
operationsthat are < GVT
processed event
“straggler” event
unprocessed event
“committed” event
![Page 9: Analysis of Cluster Failures on Blue Gene Supercomputers](https://reader036.vdocument.in/reader036/viewer/2022081514/56813ff1550346895dab0683/html5/thumbnails/9.jpg)
•PHOLD is a stress-test.
•On BG/L @ CCNI
• 1 Million LPs (note: DES of BG/L comms had 6M LPs).
• 10 events per LP
• Upto 100% probablity any event scheduled to any other LP
• Other events scheduled to self.
![Page 10: Analysis of Cluster Failures on Blue Gene Supercomputers](https://reader036.vdocument.in/reader036/viewer/2022081514/56813ff1550346895dab0683/html5/thumbnails/10.jpg)
•PHOLD on Blue Gene/P
• At 64K cores, only 16 LPs per core with 10 events per LP.
• At 128K cores, only 8 LPs per core == MAX parallelism and performance drops off significantly.
• Peak performance of 12.26 billion events/sec for 10% remote case.
![Page 11: Analysis of Cluster Failures on Blue Gene Supercomputers](https://reader036.vdocument.in/reader036/viewer/2022081514/56813ff1550346895dab0683/html5/thumbnails/11.jpg)
Challenges for Petascale Fault Tolerance
• Good news with caveats…– Our applications are scaling well.– But…scaling runs are relatively short. (i.e., < 5 mins) and so don’t
experience failures…
• One early example…– Phasta could only run for at most 5 hours using 32K nodes before
Blue Gene/L lost at least one node and whole program died…
• Systems I/O bound..cannot checkpoint..– BG/P “Intrepid” has 550+ TFlops of compute but only 55 to 60
GB/sec disk IO bandwidth using 4 MB data blocks…– At 10% of peak flops, can only do ~ 1 “double” of I/O per 8000
“double” precision FLOPS.
• So we need to understand how systems fail in order to build efficient fault tolerant supercomputer systems…
![Page 12: Analysis of Cluster Failures on Blue Gene Supercomputers](https://reader036.vdocument.in/reader036/viewer/2022081514/56813ff1550346895dab0683/html5/thumbnails/12.jpg)
Assessing Reliability on Petascale Systems
• Systems containing a large number of components experience a low mean time to failure
• Usual methods of failure analysis assume– Independent failure events– Exponentially distributed time between events
• Necessary for homogenous Markov modeling and Poisson processes
• Practical experience with systems of this scale show that– Failures are frequently cascading– Analyzing time between events in reality is difficult– Difficult to put knowledge about reliability to practical use
• Better understanding of reliability from a practical perspective would be useful– Increase reliability of large and long-running jobs– Decrease checkpoint frequency– Squeeze more efficiency from large-scale systems
![Page 13: Analysis of Cluster Failures on Blue Gene Supercomputers](https://reader036.vdocument.in/reader036/viewer/2022081514/56813ff1550346895dab0683/html5/thumbnails/13.jpg)
Our approach
• Understand underlying statistics of failure on a large system in the field
• Use this understanding to attempt to predict node reliability
• Put this knowledge to work to improve job reliability
![Page 14: Analysis of Cluster Failures on Blue Gene Supercomputers](https://reader036.vdocument.in/reader036/viewer/2022081514/56813ff1550346895dab0683/html5/thumbnails/14.jpg)
Characteristics of Failure
• Failures in large-scale systems are rarely independent singular events
• A set of failures can arise from a single underlying cause– Network problems– Rack-level problems– Software subsystem problems
• Failures are manifested as a cluster of failures– Grouped spatially (e.g., in a rack)– Grouped temporally
![Page 15: Analysis of Cluster Failures on Blue Gene Supercomputers](https://reader036.vdocument.in/reader036/viewer/2022081514/56813ff1550346895dab0683/html5/thumbnails/15.jpg)
Blue GeneWe gathered RAS logs from two large Blue
Gene systems (EPFL and RPI)
![Page 16: Analysis of Cluster Failures on Blue Gene Supercomputers](https://reader036.vdocument.in/reader036/viewer/2022081514/56813ff1550346895dab0683/html5/thumbnails/16.jpg)
Blue Gene RAS Data
• Events include a level of severity– INFO (least severe), WARNING, SEVERE,
ERROR, FATAL, FAILURE– Location and time
• Mapped these events into a 3D space to understand what was happening over time on the system– Node address -> X axis– Time of event -> Y axis– Severity level -> Z axis
![Page 17: Analysis of Cluster Failures on Blue Gene Supercomputers](https://reader036.vdocument.in/reader036/viewer/2022081514/56813ff1550346895dab0683/html5/thumbnails/17.jpg)
RPI Blue Gene Event Graph
![Page 18: Analysis of Cluster Failures on Blue Gene Supercomputers](https://reader036.vdocument.in/reader036/viewer/2022081514/56813ff1550346895dab0683/html5/thumbnails/18.jpg)
EPFL Blue Gene Event Graph
![Page 19: Analysis of Cluster Failures on Blue Gene Supercomputers](https://reader036.vdocument.in/reader036/viewer/2022081514/56813ff1550346895dab0683/html5/thumbnails/19.jpg)
Assessing Events
• Significant spatial and temporal clustering– Used cluster analysis in R to reduce
clustering– Time between events transformed to Weibull
• Needed a model to predict node reliability• In practice, nodes are either
– Healthy and operating normally– Degraded and suspect– Down
![Page 20: Analysis of Cluster Failures on Blue Gene Supercomputers](https://reader036.vdocument.in/reader036/viewer/2022081514/56813ff1550346895dab0683/html5/thumbnails/20.jpg)
Node Reliability Model
![Page 21: Analysis of Cluster Failures on Blue Gene Supercomputers](https://reader036.vdocument.in/reader036/viewer/2022081514/56813ff1550346895dab0683/html5/thumbnails/21.jpg)
Two Markov Models
• Accurate, but slow– Continuous time Markov model– Cluster analysis takes over 10 hours
• Less accurate, but much faster– Discrete time Markov model– Time step adjustable– Computed in minutes
![Page 22: Analysis of Cluster Failures on Blue Gene Supercomputers](https://reader036.vdocument.in/reader036/viewer/2022081514/56813ff1550346895dab0683/html5/thumbnails/22.jpg)
Predicted Reliability RPI & EPFL
![Page 23: Analysis of Cluster Failures on Blue Gene Supercomputers](https://reader036.vdocument.in/reader036/viewer/2022081514/56813ff1550346895dab0683/html5/thumbnails/23.jpg)
Practical Application
• We can use this information to guide the scheduler
• Rank nodes by predicted reliability– High reliability – least likely to fail– Low reliability – most likely to fail
• Assign most reliable nodes to largest queues• Assign least reliable nodes to smallest
queues
![Page 24: Analysis of Cluster Failures on Blue Gene Supercomputers](https://reader036.vdocument.in/reader036/viewer/2022081514/56813ff1550346895dab0683/html5/thumbnails/24.jpg)
![Page 25: Analysis of Cluster Failures on Blue Gene Supercomputers](https://reader036.vdocument.in/reader036/viewer/2022081514/56813ff1550346895dab0683/html5/thumbnails/25.jpg)
Summary• Our apps are scaling well on balanced hardware but
have strong need to understand how failures impact performance, especially at petascale levels
• Analysis of failure logs suggests failures follow a Weibull distribution.
• Semi-Markov models are able to access reliability of nodes on the systems
• Nodes that log a large number of RAS events (i.e., are noisy) are less reliable than nodes that log few events (i.e., < 3).
• Grouping less “noisy” nodes together creates a partition that is much less likely to fail which significantly improves overall job completion rates and reduces the need for checkpointing.