hardware-integrated approaches to failure advanced warning ralph h. castain, ph.d. los alamos...
TRANSCRIPT
Hardware-Integrated Approaches to Failure Advanced Warning
Ralph H. Castain, Ph.D.Los Alamos National Laboratory
Outline
• Little history and perspective What do we mean by “resilient”? Traditional vs embedded approach DARPA “built-in-test” program
• Cisco resilient router project Brief overview of project Our approach and partnership with OpenMPI
• Open Cluster Manager (OpenCM)
Motivation
• Head of new business unit for integrated diagnostics and control
• World’s largest customer If system fails, will search out root cause If your system, you pay cost of lost batch! Rough cost/failure: $10M Rough value of system: $200k
Resiliency
• Fault Events that hinder the correct operation of a process.
• May not actually be a “failure” of a component, but can cause system-level failure or performance degradation below specified level
Effect may be immediate or some time in the future. Usually are rare. May not have many data examples.
• Fault prediction Estimate probability of incipient fault within some time period in the future
• Fault Tolerance ………………………………………reactive, static Ability to recover from a fault
• Robustness…………………………………………..metric How much can the system absorb without catastrophic consequences
• Resilience……………………………………………..proactive, dynamic Dynamically configure system to minimize impact of potential faults
Traditional Approach to Faults:The “Bathtub”
InfantMortality
MTBF
“Floor”Region
DefinedLifetime?
B
What’s Wrong With That?
• Infant mortality Resolved by extensive burn-in: costly
• Where to define “lifetime”? A: Units decommissioned with considerable unused life B: High probability of failures in advance MTBF: ~50% of units fail before
• Bathtub floor does not sit at “zero” Still significant probability of failure
• Can’t reliably estimate system lifetime due to multi-component degradation Component-component interactions not reflected in individual component
lifetime statistics
• Failures can be costly Operational impact Replacement costs B
DARPA BIT Program
• Multi-year program in 1990s Focus on electronic, mechanical failures Create a “resilient war fighting” capability Enable better maintenance support of increasingly
complex systems
• Objectives Push-button “good box/bad box” readout
• Eliminate diagnostic “carts”, “toolboxes”,…
Pre-emptive switch from failing systems “Okay for mission” test
• Reduce probability of failures during mission
Results Encouraging
• Vibration signatures Impending bearing failures
• Fans, axles, transmissions
• Thermal patterns Mechanical failures
• Existence of hot spots• Patterns revealed root causes, better prediction
Electronic failures• Patterns across boards, surface of chips
• Electrical frequency composition Breakdowns in power transistors, other devices IC internal wire connection degradation
General Conclusions
• Exploit access to internals Investigate optimal location, number of sensors Embed intelligence, communications capability
• Integrate data from all available sources Engineering design tests Reliability life tests Production qualification tests
• Utilize learning algorithms to improve performance Both embedded, post process Seed with expert knowledge
Objective
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Prob of Failure
1 2 3 4 5 6 7 8 9 10
Time Interval
Motivation
• Head of new business unit for integrated diagnostics and control
• World’s largest customer If system fails, will search out root cause If your system, you pay cost of lost batch! Rough cost/failure: $10M Rough value of system: $200k
Questions
• Can we develop technologies that would… Warn of impending failure
• Provide time to reconfigure, respond• Allow switch to backup systems for continuous
operation• Provide an opportunity to pace ourselves
“Stretch” life of system
With minimal overhead• Cannot significantly impact performance
• How would we use them?
Direct DetectionSpectralFilter
ADC
PZT
Temp
VoltageCurrent
PZT
Temp
VoltageCurrent
ADC
VoltageCurrent
ADC
FDDPAnalyzer
Good BoxBad Box
ProblemDiagnosis
FaultPrediction
~ -
Integrate All Factors
Results (generalized)
• Prediction Better than 97% faults predicted within
specified response time (hours) Less than 5% “bad” prediction rate
• Diagnosis Better than 80% correct localization
• Detection (good/bad box) Better than 99% correct identification Less than 5% false positive rate
Outline
• Little history and perspective What do we mean by “resilient”? Traditional vs embedded approach DARPA “built-in-test” program
• Cisco resilient router project Brief overview of project Our approach and partnership with OpenMPI
• Open Cluster Manager (OpenCM)
17© 2006 Cisco Systems, Inc. All rights reserved.
1) Internet Traffic Growth and interconnect requirements are growing faster than Silicon and Software available power are.
2) One approach is to build a larger more Distributed System.
3) Result are increased requirements on System Software in terms of:
a) High Availability across a multi-component system
b) Coherent view of intra-component messaging
c) Fast Convergence amongst components during change
d) Distributed Failover and effective sharing of load.
e) SW/HW maintenance w/o service impact
Problem Statements
19© 2006 Cisco Systems, Inc. All rights reserved.
1
10
100
1000
10000
System BW
MHz-gate/mW
Mbps/W
System Power
Shortfall!
Shortfall is overcome by architectural innovation and trading off:Performance, functionality, programmability, physical size/density
Very hard to sustain long-term
Technology is falling behind Demand Curve
Problem Drivers
20© 2006 Cisco Systems, Inc. All rights reserved.
Product example
• Largest Routing System available today
Each Linecard Chassis: 1.28Tbps, 13.6kW
Switch Fabric Chassis: 8kW
Hardware Details
21© 2006 Cisco Systems, Inc. All rights reserved.
Product example
• Maximum HW configuration: 92Tbps Switching capacity across millions of interfaces.
48 x LC chassis + 8 x Fabric chassis
=> System Messaging Across all control CPUs to manage switch fabric
and interface control
Hardware Details
22© 2006 Cisco Systems, Inc. All rights reserved.
System Software Requirements
1) Turn on once with remote access thereafter
2) Non-Stop == max 20 events/day lasting < 200ms each
3) Hitless SW Upgrades and Downgrades
4) Upgrade/downgrade SW components across delta versions
5) Field Patchable
6) Beta Test New Features in situ
7) Extensive Trace Facilities: on Routes, Tunnels, Subscribers,…
8) Configuration
9) Clear APIs; minimize application awareness
10) Extensive remote capabilities for fault management, software maintenance and software installations
Software Details
Our Approach: Use OpenRTE
• Setup for new frameworks Sensor - monitor hardware, software FDDP - use sensor inputs to compute sliding
window or probabilities
• Contribute back to OpenMPI Proprietary modules as binary plug-ins
• Write new cluster manager Exploit new capabilities Create as non-centralized application
ORTE Extensions
• Software sensors Memory footprint, cpu utilization (upper and lower),
output file size
• Hardware sensors Temperature, vibration
• FDDP B-spline trend fit
• Resilient mapper Fault groups
• Nodes with common failure mode• Node can belong to multiple fault groups
Map replicas across fault groups
Cluster Manager
• Orted auto-starts upon node power-up Auto-detect and connect to CM
• CM launches specified number of replicas of each application Resilient mapper => minimize single point
failures
• Applications auto-wireup Plug-and-play inspired approach Application decides which input to declare
“leader”
Application Failure
• Orted detects (or predicts) failure and notifies CM
• CM utilizes resilient mapper to determine location of replacement Future extension: probability of failure modes
to help drive fault group selection New replica is launched, does auto-wireup
• Connected applications Loss of communication from “leader” Independently select new “leader”
Outline
• Little history and perspective What do we mean by “resilient”? Traditional vs embedded approach DARPA “built-in-test” program
• Cisco resilient router project Brief overview of project Our approach and partnership with OpenMPI
• Open Cluster Manager (OpenCM)
OpenCM
• Transition Cisco work to open source
• Broaden mission Extend to HPC, other embedded operations Manage any collection of nodes Resilient operation with hooks
• MPI• Other application layers
• Released under the OpenMPI license BSD-like, open use
http://www.open-mpi.org/
Concluding Remarks