lo|fa|mo:fault detection and systemic awareness for the ...lo|fa|mo hfm implementation...

LO|FA|MO:Fault Detection and Systemic Awareness forthe QUonG computing system

Roberto Ammendolaon behalf of the APE Group

Istituto Nazionale di Fisica Nucleare, Sezione Roma Tor Vergata

IEEE Symposium on Reliable Distributed Systems,Nara, Japan – October 6-9 2014

EURETILE

R. Ammendola (INFN RM2) LO|FA|MO SRDS 2014 1 / 19

APE: Our research group

Array Processor Experiment is a 25 years old project at INFN.Developing fully custom and hybrid parallel computing machines.Research advances in floating poing engines, interconnection networks,system integration, . . .

19882006


Fault tolerance problem

Faults and Critical events of various types affect all complex computing systemsEmbedded Systems can be employed in fields where requirements ofreliability and time predictability are exceedingly strictIn High Performance computing, when scaling towards exa-scale the MTBFcan be dramatically small even if techniques are applied to increase thecomponents resilience

A systemic fault tolerance technique is mandatory!


Fault and Critical Event Management

Divide et imperaSplit the problem of fault managementinto smaller/simpler sub-problems

Hierarchical Fault ToleranceFaults are detected locally and then theinformation is spread over the differentnodes and levelsSupervisor nodes are distributedFault reaction follows the inverse path

Local Fault Detection

Leaf Tile re-adjustment

Sub-system Fault

Awareness

Sub-system Response

Global Fault

Awareness

Systemic Response

Hierarchical Fault Tolerance

Fault Awareness Fault Reactivity


Fault awareness

System level Fault AwarenessThe system must be aware of faults to be able to reactRequires detection mechanisms at local levelRequires a communication network able to send diagnostic messages(possibly without increasing application data transmission latency)


Fault reaction

System reaction (automatic or non-automatic) to faults

Typically 3 classes of methods:Avoid the failures (e.g. prediction and migration)Avoid the effect of failures (e.g system replication)Repair the effect of failures (e.g. checkpoint and restart)

They all require fault awareness!


Local Fault Monitor: key principles

The Local Fault Monitor (LO|FA|MO) is our approach to fault awareness fordistributed systemsBased on:

A LO|FA|MO-enabled network interface implementing 3D toroidal networkbetween computing nodes (i.e. our Distributed Network Processor - DNP)A Mutual Watchdog mechanism between host and DNP on each nodeA Service Network for diagnostic messages

This approach guarantees no-single-point-of-failure fault awareness!


Mutual Watchdog mechanism

Two peers monitoring each other by reading/writing the Host/DNP WatchdogRegister :

Host HWR

DNP FAULT MANAGER Memory

Service

Network

B

U

S

DWR

Configuration

Registers

Remote Fault

Descriptor

Registers Sensors (T, V, I)

Torus Link Host

Fault

Manager

Read Write Diagnostic info

DNP fault manager (DFM)hardware component inside the DNPcollects information about the DNP statusencodes the DNP status in the DWRreads the host status in the HWR

Host Fault Manager (HFM)software component running on the hostcollects information about the host statusencodes the host status in the HWRreads the DNP status in the DWR

Both are able to check whetherthe other is still alive (periodicalstatus update Twrite < Tread)


LO|FA|MO fault detection

What kind of faults (and critical events) LO|FA|MO is able to detect?

Host faultsThe host is down (detected by means of the watchdog mechanism)Service Network is down or malfuctioningTemperature, memory, peripherals...anything that can be checked by theHFM by querying sensors

DNP faultsDNP is down (detected by means of the watchdog mechanism)DNP links are faulty (broken/unplugged, too many transmission errors,...)Temperature, voltage, current over thresholds (interface with sensors)Other DNP logic faults


LO|FA|MO Supervisor nodes

A Supervisor node:like other nodes in the mesh but runs a special instance of the HFMcollects diagnostic messages coming from the Service Networkcan be completed with heuristics for decision making (trigger for faultreaction)

Multiple Supervisors ensureno-single-point-of-failure faultawareness

divide the mesh intoregions, each with aSupervisorproperly choose theSupervisor node in eachregion (in the border!)

S

S


Putting all together: fault awareness with LO|FA|MO

Local Fault Detectionsensors and self-diagnostic logicwatchdog mechanism (one of the peer misses the status update:stopping fault)

Sub-system Fault awarenessthe DWR contains the DNP status → host awarenessthe HWR contains host status → DNP awareness

Systemic Fault AwarenessDFM sends diagnostic messages via 3D net towards the firstneighbouring nodes (FN nodes)HFM sends diagnostic messages via Service Network to a SupervisornodeDFM of the FN nodes receive diagnostic messages and encode faults inDWRHFM of the FN nodes become aware of faults occurring on the nodehostThe mesh of nodes is monitored by one or multiple Supervisor nodes(global awareness)

Fault Detection

Sub-system Fault

Awareness

Systemic Fault Awareness

Fault Awareness


LO|FA|MO Systemic Fault AwarenessDouble path for systemic fault awareness!

S

Host says: DNP is broken!

DNP says: Host is down!

3D net

Service net


APENet+ Custom interconnect

Aimed for:

low latencyhigh bandwidthCPU offloading

SO-DIMM DDR3

mini-USB

X+

X-

Y+

Y-Z-

Z+

Gbit EthernetExternal Power

Gbit EthernetProgrammableDevicePCI-e connectorQSFP+ Connectors

Altera Stratix IV based NIC:PCIe Gen2 x86 bidirectional Off-Board links40 Gb QSFP+ interconnect fabricRDMA communication paradigmGPU Direct capable (NVidia peer-to-peer protocol)

Re-using IP cores also in High Energy Physics and Particle Physics dataacquisition experiments.


QUonG cluster

QUantum chromodynamics on Gpu is a comprehensive initiative aimed atproviding a hybrid, GPU-accelerated x86_64 cluster with a 3D toroidal meshtopology, able to scale up to 104 − 105 nodes

Current Status16 nodes equipped with APEnet+ board

QUonG Hybrid Computing Nodedouble Intel Xeon E562048GB System Memory2x S2075 NVIDIA Fermi GPUs1 APEnet+ board40 Gb/s InfiniBand HCA

Software EnvironmentCentOS 6.4NVIDIA CUDA 6.5 driverOpenMPI and MVAPICH2 MPI


LO|FA|MO DFM implementationDNP fault Manager

component in the APEnet+ FPGA design (uses very littleof FPGA resources)interacts with APENet+ on-board sensorsconfigurable by the HFM (timers, masks)implements Mutual Watchdog (Watchdog registers are inthe FPGA and memory-mapped on PCIe)able to send diagnostic messages to first neighbours, butmessages are hidden in the physical protocol (Link FaultManager)

DN

P F

ault

Man

ager

Transmission Control Logic

Physical Layer

TX

RX

DNP data flow

X+ lin

k

LiFaMa TX FIFO

LiFaMa RX FIFO

DNP

LiFaMa Messages TX flow

LiFaMa Messages RX flow

LiFaMa

NxN ports switch

link ctrl

link ctrl

link ctrl

link ctrl

link ctrl

link ctrl

routing logic

arbiter

PCIe X8 Gen2 core

Network Interface

DNP

TX/RX Block

32bit Micro Controller

Collective comm block

memory controller

GPU I/O accelerator

On

Bo

ard

Mem

ory

1G

bE

po

rt

1 … … … N …

Router

Off-board Interface

DNP Fault

Manager


LO|FA|MO HFM implementation

Host Fault Manager

implemented as GNU Linux multi-threadeddaemon running on QUonG nodesconfigurable (nodes, supervisors, watchdogperiods,...) via GUIimplements Mutual Watchdog (access towatchdog register on APEnet+ device)queries host sensors and diagnostic toolssends diagnostic messages via ServiceNetwork (Ethernet)special thread on Supervisor nodes to gatherdiagnostic messages

DNP

WD

Host

WD

SNET

Monitor

Fault

Notifier

Node A Supervisor Node

DNP

WD

Host

WD

SNET Monitor

Fault Notifier

Node B

DNP

WD

Host

WD

SNET Monitor

Fault Notifier

Node C

DNP

WD

Host

WD

SNET

Fault

Logger

Fault

Notifier

SNET

Monitor


Results on QUonG

Time to fault awareness measured on theSupervisor node

T2 − T1 = 0.9s @Tread = 500ms, Twrite =100msT1 = time of fault injectionT2 = time of Supervisor awareness

Low impact on system performance → greatscaling

HFM resource usageVirtual Mem Shared Mem Resident Mem CPU usage120MB 776B 1KB <0.1%

PCIe bus occupancy1 APEnet+ register access takes 6 µs

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 100 200 300 400 500 600

Faul

t Aw

aren

ess

Tim

e (s

)

Watchdog Read Period (ms)

LO|FA|MO host fault, time to awareness

Time to Supervisor Awareness

0.4

0.6

0.8

1

1.2

1.4

1.6

2 4 8 16 32

Faul

t Aw

aren

ess

Tim

e (s

)

Number of QUonG nodes

LO|FA|MO host fault, time to awareness scaling

Time to Supervisor Awareness(TR=500 ms TW=100 ms)


Conclusions and future work

We designed the Local Fault Monitor (LO|FA|MO) as a scalable approach tosystemic fault awareness for distributed systemsLO|FA|MO implementation on QUonG cluster that relies on the APEnet+interconnect card, is lightweight and scalable:

low resource occupancy (HW and SW)diagnostic messages on Service Network or embedded in 3D network physicalprotocoltime to global fault awareness depends on and is dominated by the watchdogperiod

New developments:pairing of LO|FA|MO with a fault reaction system (automatic task migration)introduction of GPUs monitoring. . .


Thank you! Questions or comments?


lo|fa|mo:fault detection and systemic awareness for the ...lo|fa|mo hfm implementation...

Documents