hardware-integrated approaches to failure advanced warning ralph h. castain, ph.d. los alamos...

Hardware-Integrated Approaches to Failure Advanced Warning

Ralph H. Castain, Ph.D.Los Alamos National Laboratory

Outline

• Little history and perspective What do we mean by “resilient”? Traditional vs embedded approach DARPA “built-in-test” program

• Cisco resilient router project Brief overview of project Our approach and partnership with OpenMPI

• Open Cluster Manager (OpenCM)

Motivation

• Head of new business unit for integrated diagnostics and control

• World’s largest customer If system fails, will search out root cause If your system, you pay cost of lost batch! Rough cost/failure: $10M Rough value of system: $200k

Resiliency

• Fault Events that hinder the correct operation of a process.

• May not actually be a “failure” of a component, but can cause system-level failure or performance degradation below specified level

Effect may be immediate or some time in the future. Usually are rare. May not have many data examples.

• Fault prediction Estimate probability of incipient fault within some time period in the future

• Fault Tolerance ………………………………………reactive, static Ability to recover from a fault

• Robustness…………………………………………..metric How much can the system absorb without catastrophic consequences

• Resilience……………………………………………..proactive, dynamic Dynamically configure system to minimize impact of potential faults

Traditional Approach to Faults:The “Bathtub”

InfantMortality

MTBF

“Floor”Region

DefinedLifetime?

B

What’s Wrong With That?

• Infant mortality Resolved by extensive burn-in: costly

• Where to define “lifetime”? A: Units decommissioned with considerable unused life B: High probability of failures in advance MTBF: ~50% of units fail before

• Bathtub floor does not sit at “zero” Still significant probability of failure

• Can’t reliably estimate system lifetime due to multi-component degradation Component-component interactions not reflected in individual component

lifetime statistics

• Failures can be costly Operational impact Replacement costs B

DARPA BIT Program

• Multi-year program in 1990s Focus on electronic, mechanical failures Create a “resilient war fighting” capability Enable better maintenance support of increasingly

complex systems

• Objectives Push-button “good box/bad box” readout

• Eliminate diagnostic “carts”, “toolboxes”,…

Pre-emptive switch from failing systems “Okay for mission” test

• Reduce probability of failures during mission

Results Encouraging

• Vibration signatures Impending bearing failures

• Fans, axles, transmissions

• Thermal patterns Mechanical failures

• Existence of hot spots• Patterns revealed root causes, better prediction

Electronic failures• Patterns across boards, surface of chips

• Electrical frequency composition Breakdowns in power transistors, other devices IC internal wire connection degradation

General Conclusions

• Exploit access to internals Investigate optimal location, number of sensors Embed intelligence, communications capability

• Integrate data from all available sources Engineering design tests Reliability life tests Production qualification tests

• Utilize learning algorithms to improve performance Both embedded, post process Seed with expert knowledge

Objective

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Prob of Failure

1 2 3 4 5 6 7 8 9 10

Time Interval

Motivation

• Head of new business unit for integrated diagnostics and control

• World’s largest customer If system fails, will search out root cause If your system, you pay cost of lost batch! Rough cost/failure: $10M Rough value of system: $200k

Questions

• Can we develop technologies that would… Warn of impending failure

• Provide time to reconfigure, respond• Allow switch to backup systems for continuous

operation• Provide an opportunity to pace ourselves

“Stretch” life of system

With minimal overhead• Cannot significantly impact performance

• How would we use them?

Direct DetectionSpectralFilter

ADC

PZT

Temp

VoltageCurrent

PZT

Temp

VoltageCurrent

ADC

VoltageCurrent

ADC

FDDPAnalyzer

Good BoxBad Box

ProblemDiagnosis

FaultPrediction

~ -

Integrate All Factors

Results (generalized)

• Prediction Better than 97% faults predicted within

specified response time (hours) Less than 5% “bad” prediction rate

• Diagnosis Better than 80% correct localization

• Detection (good/bad box) Better than 99% correct identification Less than 5% false positive rate

Outline




17© 2006 Cisco Systems, Inc. All rights reserved.

1) Internet Traffic Growth and interconnect requirements are growing faster than Silicon and Software available power are.

2) One approach is to build a larger more Distributed System.

3) Result are increased requirements on System Software in terms of:

a) High Availability across a multi-component system

b) Coherent view of intra-component messaging

c) Fast Convergence amongst components during change

d) Distributed Failover and effective sharing of load.

e) SW/HW maintenance w/o service impact

Problem Statements


1

10

100

1000

10000

System BW

MHz-gate/mW

Mbps/W

System Power

Shortfall!

Shortfall is overcome by architectural innovation and trading off:Performance, functionality, programmability, physical size/density

Very hard to sustain long-term

Technology is falling behind Demand Curve

Problem Drivers


Product example

• Largest Routing System available today

Each Linecard Chassis: 1.28Tbps, 13.6kW

Switch Fabric Chassis: 8kW

Hardware Details


Product example

• Maximum HW configuration: 92Tbps Switching capacity across millions of interfaces.

48 x LC chassis + 8 x Fabric chassis

=> System Messaging Across all control CPUs to manage switch fabric

and interface control

Hardware Details


System Software Requirements

1) Turn on once with remote access thereafter

2) Non-Stop == max 20 events/day lasting < 200ms each

3) Hitless SW Upgrades and Downgrades

4) Upgrade/downgrade SW components across delta versions

5) Field Patchable

6) Beta Test New Features in situ

7) Extensive Trace Facilities: on Routes, Tunnels, Subscribers,…

8) Configuration

9) Clear APIs; minimize application awareness

10) Extensive remote capabilities for fault management, software maintenance and software installations

Software Details

Our Approach: Use OpenRTE

• Setup for new frameworks Sensor - monitor hardware, software FDDP - use sensor inputs to compute sliding

window or probabilities

• Contribute back to OpenMPI Proprietary modules as binary plug-ins

• Write new cluster manager Exploit new capabilities Create as non-centralized application

ORTE Extensions

• Software sensors Memory footprint, cpu utilization (upper and lower),

output file size

• Hardware sensors Temperature, vibration

• FDDP B-spline trend fit

• Resilient mapper Fault groups

• Nodes with common failure mode• Node can belong to multiple fault groups

Map replicas across fault groups

Cluster Manager

• Orted auto-starts upon node power-up Auto-detect and connect to CM

• CM launches specified number of replicas of each application Resilient mapper => minimize single point

failures

• Applications auto-wireup Plug-and-play inspired approach Application decides which input to declare

“leader”

Application Failure

• Orted detects (or predicts) failure and notifies CM

• CM utilizes resilient mapper to determine location of replacement Future extension: probability of failure modes

to help drive fault group selection New replica is launched, does auto-wireup

• Connected applications Loss of communication from “leader” Independently select new “leader”

Outline




OpenCM

• Transition Cisco work to open source

• Broaden mission Extend to HPC, other embedded operations Manage any collection of nodes Resilient operation with hooks

• MPI• Other application layers

• Released under the OpenMPI license BSD-like, open use

http://www.open-mpi.org/

Concluding Remarks

hardware-integrated approaches to failure advanced warning ralph h. castain, ph.d. los alamos...

Documents

mission slide

objective slide

system lifetime

expert knowledge slide

systemlevel failure

high probability of

rough value of system

fault robustness