lo|fa|mo:fault detection and systemic awareness for the ...lo|fa|mo hfm implementation...
TRANSCRIPT
LO|FA|MO:Fault Detection and Systemic Awareness forthe QUonG computing system
Roberto Ammendolaon behalf of the APE Group
Istituto Nazionale di Fisica Nucleare, Sezione Roma Tor Vergata
IEEE Symposium on Reliable Distributed Systems,Nara, Japan – October 6-9 2014
EURETILE
R. Ammendola (INFN RM2) LO|FA|MO SRDS 2014 1 / 19
APE: Our research group
Array Processor Experiment is a 25 years old project at INFN.Developing fully custom and hybrid parallel computing machines.Research advances in floating poing engines, interconnection networks,system integration, . . .
19882006
R. Ammendola (INFN RM2) LO|FA|MO SRDS 2014 2 / 19
Fault tolerance problem
Faults and Critical events of various types affect all complex computing systemsEmbedded Systems can be employed in fields where requirements ofreliability and time predictability are exceedingly strictIn High Performance computing, when scaling towards exa-scale the MTBFcan be dramatically small even if techniques are applied to increase thecomponents resilience
A systemic fault tolerance technique is mandatory!
R. Ammendola (INFN RM2) LO|FA|MO SRDS 2014 3 / 19
Fault and Critical Event Management
Divide et imperaSplit the problem of fault managementinto smaller/simpler sub-problems
Hierarchical Fault ToleranceFaults are detected locally and then theinformation is spread over the differentnodes and levelsSupervisor nodes are distributedFault reaction follows the inverse path
Local Fault Detection
Leaf Tile re-adjustment
Sub-system Fault
Awareness
Sub-system Response
Global Fault
Awareness
Systemic Response
Hierarchical Fault Tolerance
Fault Awareness Fault Reactivity
R. Ammendola (INFN RM2) LO|FA|MO SRDS 2014 4 / 19
Fault awareness
System level Fault AwarenessThe system must be aware of faults to be able to reactRequires detection mechanisms at local levelRequires a communication network able to send diagnostic messages(possibly without increasing application data transmission latency)
R. Ammendola (INFN RM2) LO|FA|MO SRDS 2014 5 / 19
Fault reaction
System reaction (automatic or non-automatic) to faults
Typically 3 classes of methods:Avoid the failures (e.g. prediction and migration)Avoid the effect of failures (e.g system replication)Repair the effect of failures (e.g. checkpoint and restart)
They all require fault awareness!
R. Ammendola (INFN RM2) LO|FA|MO SRDS 2014 6 / 19
Local Fault Monitor: key principles
The Local Fault Monitor (LO|FA|MO) is our approach to fault awareness fordistributed systemsBased on:
A LO|FA|MO-enabled network interface implementing 3D toroidal networkbetween computing nodes (i.e. our Distributed Network Processor - DNP)A Mutual Watchdog mechanism between host and DNP on each nodeA Service Network for diagnostic messages
This approach guarantees no-single-point-of-failure fault awareness!
R. Ammendola (INFN RM2) LO|FA|MO SRDS 2014 7 / 19
Mutual Watchdog mechanism
Two peers monitoring each other by reading/writing the Host/DNP WatchdogRegister :
Host HWR
DNP FAULT MANAGER Memory
Service
Network
B
U
S
DWR
Configuration
Registers
Remote Fault
Descriptor
Registers Sensors (T, V, I)
Torus Link Host
Fault
Manager
Read Write Diagnostic info
DNP fault manager (DFM)hardware component inside the DNPcollects information about the DNP statusencodes the DNP status in the DWRreads the host status in the HWR
Host Fault Manager (HFM)software component running on the hostcollects information about the host statusencodes the host status in the HWRreads the DNP status in the DWR
Both are able to check whetherthe other is still alive (periodicalstatus update Twrite < Tread)
R. Ammendola (INFN RM2) LO|FA|MO SRDS 2014 8 / 19
LO|FA|MO fault detection
What kind of faults (and critical events) LO|FA|MO is able to detect?
Host faultsThe host is down (detected by means of the watchdog mechanism)Service Network is down or malfuctioningTemperature, memory, peripherals...anything that can be checked by theHFM by querying sensors
DNP faultsDNP is down (detected by means of the watchdog mechanism)DNP links are faulty (broken/unplugged, too many transmission errors,...)Temperature, voltage, current over thresholds (interface with sensors)Other DNP logic faults
R. Ammendola (INFN RM2) LO|FA|MO SRDS 2014 9 / 19
LO|FA|MO Supervisor nodes
A Supervisor node:like other nodes in the mesh but runs a special instance of the HFMcollects diagnostic messages coming from the Service Networkcan be completed with heuristics for decision making (trigger for faultreaction)
Multiple Supervisors ensureno-single-point-of-failure faultawareness
divide the mesh intoregions, each with aSupervisorproperly choose theSupervisor node in eachregion (in the border!)
S
S
R. Ammendola (INFN RM2) LO|FA|MO SRDS 2014 10 / 19
Putting all together: fault awareness with LO|FA|MO
Local Fault Detectionsensors and self-diagnostic logicwatchdog mechanism (one of the peer misses the status update:stopping fault)
Sub-system Fault awarenessthe DWR contains the DNP status → host awarenessthe HWR contains host status → DNP awareness
Systemic Fault AwarenessDFM sends diagnostic messages via 3D net towards the firstneighbouring nodes (FN nodes)HFM sends diagnostic messages via Service Network to a SupervisornodeDFM of the FN nodes receive diagnostic messages and encode faults inDWRHFM of the FN nodes become aware of faults occurring on the nodehostThe mesh of nodes is monitored by one or multiple Supervisor nodes(global awareness)
Fault Detection
Sub-system Fault
Awareness
Systemic Fault Awareness
Fault Awareness
R. Ammendola (INFN RM2) LO|FA|MO SRDS 2014 11 / 19
LO|FA|MO Systemic Fault AwarenessDouble path for systemic fault awareness!
S
Host says: DNP is broken!
DNP says: Host is down!
3D net
Service net
R. Ammendola (INFN RM2) LO|FA|MO SRDS 2014 12 / 19
APENet+ Custom interconnect
Aimed for:
low latencyhigh bandwidthCPU offloading
SO-DIMM DDR3
mini-USB
X+
X-
Y+
Y-Z-
Z+
Gbit EthernetExternal Power
Gbit EthernetProgrammableDevicePCI-e connectorQSFP+ Connectors
Altera Stratix IV based NIC:PCIe Gen2 x86 bidirectional Off-Board links40 Gb QSFP+ interconnect fabricRDMA communication paradigmGPU Direct capable (NVidia peer-to-peer protocol)
Re-using IP cores also in High Energy Physics and Particle Physics dataacquisition experiments.
R. Ammendola (INFN RM2) LO|FA|MO SRDS 2014 13 / 19
QUonG cluster
QUantum chromodynamics on Gpu is a comprehensive initiative aimed atproviding a hybrid, GPU-accelerated x86_64 cluster with a 3D toroidal meshtopology, able to scale up to 104 − 105 nodes
Current Status16 nodes equipped with APEnet+ board
QUonG Hybrid Computing Nodedouble Intel Xeon E562048GB System Memory2x S2075 NVIDIA Fermi GPUs1 APEnet+ board40 Gb/s InfiniBand HCA
Software EnvironmentCentOS 6.4NVIDIA CUDA 6.5 driverOpenMPI and MVAPICH2 MPI
R. Ammendola (INFN RM2) LO|FA|MO SRDS 2014 14 / 19
LO|FA|MO DFM implementationDNP fault Manager
component in the APEnet+ FPGA design (uses very littleof FPGA resources)interacts with APENet+ on-board sensorsconfigurable by the HFM (timers, masks)implements Mutual Watchdog (Watchdog registers are inthe FPGA and memory-mapped on PCIe)able to send diagnostic messages to first neighbours, butmessages are hidden in the physical protocol (Link FaultManager)
DN
P F
ault
Man
ager
Transmission Control Logic
Physical Layer
TX
RX
DNP data flow
X+ lin
k
LiFaMa TX FIFO
LiFaMa RX FIFO
DNP
LiFaMa Messages TX flow
LiFaMa Messages RX flow
LiFaMa
NxN ports switch
link ctrl
link ctrl
link ctrl
link ctrl
link ctrl
link ctrl
routing logic
arbiter
PCIe X8 Gen2 core
Network Interface
DNP
TX/RX Block
32bit Micro Controller
Collective comm block
memory controller
GPU I/O accelerator
On
Bo
ard
Mem
ory
1G
bE
po
rt
1 … … … N …
Router
Off-board Interface
DNP Fault
Manager
R. Ammendola (INFN RM2) LO|FA|MO SRDS 2014 15 / 19
LO|FA|MO HFM implementation
Host Fault Manager
implemented as GNU Linux multi-threadeddaemon running on QUonG nodesconfigurable (nodes, supervisors, watchdogperiods,...) via GUIimplements Mutual Watchdog (access towatchdog register on APEnet+ device)queries host sensors and diagnostic toolssends diagnostic messages via ServiceNetwork (Ethernet)special thread on Supervisor nodes to gatherdiagnostic messages
DNP
WD
Host
WD
SNET
Monitor
Fault
Notifier
Node A Supervisor Node
DNP
WD
Host
WD
SNET Monitor
Fault Notifier
Node B
DNP
WD
Host
WD
SNET Monitor
Fault Notifier
Node C
DNP
WD
Host
WD
SNET
Fault
Logger
Fault
Notifier
SNET
Monitor
R. Ammendola (INFN RM2) LO|FA|MO SRDS 2014 16 / 19
Results on QUonG
Time to fault awareness measured on theSupervisor node
T2 − T1 = 0.9s @Tread = 500ms, Twrite =100msT1 = time of fault injectionT2 = time of Supervisor awareness
Low impact on system performance → greatscaling
HFM resource usageVirtual Mem Shared Mem Resident Mem CPU usage120MB 776B 1KB <0.1%
PCIe bus occupancy1 APEnet+ register access takes 6 µs
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 100 200 300 400 500 600
Faul
t Aw
aren
ess
Tim
e (s
)
Watchdog Read Period (ms)
LO|FA|MO host fault, time to awareness
Time to Supervisor Awareness
0.4
0.6
0.8
1
1.2
1.4
1.6
2 4 8 16 32
Faul
t Aw
aren
ess
Tim
e (s
)
Number of QUonG nodes
LO|FA|MO host fault, time to awareness scaling
Time to Supervisor Awareness(TR=500 ms TW=100 ms)
R. Ammendola (INFN RM2) LO|FA|MO SRDS 2014 17 / 19
Conclusions and future work
We designed the Local Fault Monitor (LO|FA|MO) as a scalable approach tosystemic fault awareness for distributed systemsLO|FA|MO implementation on QUonG cluster that relies on the APEnet+interconnect card, is lightweight and scalable:
low resource occupancy (HW and SW)diagnostic messages on Service Network or embedded in 3D network physicalprotocoltime to global fault awareness depends on and is dominated by the watchdogperiod
New developments:pairing of LO|FA|MO with a fault reaction system (automatic task migration)introduction of GPUs monitoring. . .
R. Ammendola (INFN RM2) LO|FA|MO SRDS 2014 18 / 19
Thank you! Questions or comments?
R. Ammendola (INFN RM2) LO|FA|MO SRDS 2014 19 / 19