characterization and mitigation of failures in complex systems center for advanced computing and...
TRANSCRIPT
Characterization and Mitigation of Failures in Complex Systems
Center for Advanced
Computing and Communications
Dept. of Electrical & Computer Eng
Duke University
Durham, NC 27708-0294
Dr. K. S. Trivedi Dr. Y. Hong
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
System Dependability
Dependability
Impairments
Means
Attributes
Faults
Errors
Failures
Procurement
Validation
Availability
Reliability
Safety
Security
Fault prevention (FP)
Fault tolerance (FT)
Fault removal (FR)
Fault forecasting (FF)
Survivability+ Service
Performability+ Performance
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
System Dependability
FP
Non-probabilistic modeling
Probabilistic modeling
FTRedundancy: hardware,software,information,time
Diversity: data, design, environment
FF
FRVerification (testing)
Maintenance: corrective (repair) and preventive
Minimal maintenance
Applications
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Modeling and Analysis
Probabilistic Modeling
Methods & Tools
Fault removal
Fault preventionFault tolerance
Fault forecasting
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
A Research Triangle
ToolsTools ApplicationsApplications
TheoryTheory Modeling methods and
Numerical solutions:
CTMC, SRN, MRSPN
FSPN, NHCTMC, FTREE
SHARPE
SPNP
SREPT
SRA
Hardware and software:
Fault-tolerant Systems
Real-time Systems
Computer Networks
Wireless Networks
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Modeling Methods
Probabilistic modeling
Fault trees
DTMC, CTMC
Non-probabilistic modelingNonlinear dynamics
FSPN
SPN
SDE
NHCTMC
MRSPN
SRN
HS
MRGPSMP
FBM, FGN
HS
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
CTMC SMP MRGP
T ~ Exp
CTMC
t
state
T ~ Gen
SMP
t
state
T1 ~ Gen
MRGP
t
state
T1 ~ Gen
T ~ Exp
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Relation: Dependability Models
GSPN SRNCTMC
RG
FTRE
MRM
RBD FT
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Modeling of HS and FSPN Statistical multiplexed ATM switch and HS model:
Bt
B
High priority Low priority
TRAFFIC
ATM SWITCH
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Modeling of HS and FSPN
FSPN model for ATM switch
K
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Solution Techniques Algorithms
Fast algorithms for network reliability and fault tree computation
Fast algorithms for SRN Numerical and simulation tools
SHARPE (symbolic hierarchical automated reliability and performance evaluator)
SREPT (software reliability estimation prediction tool)
SPNP (stochastic Petri net package) SRA (software rejuvenation agents)
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Architecture of SHARPE
ftree
relgraph
block
graph pfqn, mfqn
Hierarchical & Hybrid Compositions
SMP/MRGPMarkov
GSPN/SRN
Reliability/availability Performance analysisPerformability
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Architecture of SPNPStochastic Reward Net (SRN) models
Markovian Stochastic Petri Net
Non-MarkovianStochastic Petri Net
Fluid StochasticPetri Net (FSPN)
Analytic-Numeric Method Discrete Event Simulation (DES)
Steady-State
Transient Steady-State Transient
SOR
Gauss-Seidel
Power Method
Std. UniformizationFox-Glynn MethodStiff Uniformization
Batch Means
Regenerative Simulation
Indep. Replication
Restart
Importance SamplingImportance Splitting
FastMethods
Reachability Graph
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Architecture of SREPT
SREPT
Black-boxapproach
Architecture-basedapproach
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Black-box Approach
ENHPP modelENHPP model
ENHPP modelENHPP model
Test coverageTest coverage
Estimate of theEstimate of thenumber of faultsnumber of faults
ENHPP modelENHPP model
Sequence of Sequence of Interfailure timesInterfailure times
PredictedPredicted1. failure intensity1. failure intensity2. #faults remaining2. #faults remaining3. reliability after release3. reliability after release4. coverage4. coverage
PredictedPredicted1. failure intensity1. failure intensity2. #faults remaining2. #faults remaining3. reliability after release3. reliability after release
Software Software Product/Process Product/Process
Metrics Metrics + Past experience+ Past experience
Fault DensityFault Density
Regression TreeRegression Tree
Optimization & Optimization & ENHPP modelENHPP model
Sequence of Sequence of Interfailure times Interfailure times & Release criteria& Release criteria
Predicted release timePredicted release timeand Predicted coverageand Predicted coverage
PredictedPredicted1. failure intensity1. failure intensity2. #faults remaining2. #faults remaining3. reliability after release3. reliability after release
Failure intensityFailure intensityDebugging policiesDebugging policies
NHCTMCNHCTMC
Discrete-event Discrete-event simulation simulation
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
ArchitectureArchitecture
Failure behavior Failure behavior of componentsof components
Analytical Analytical modelingmodeling
Discrete-event Discrete-event simulationsimulation
ReliabilityReliabilityandand
PerformancePerformancepredictionprediction
Architecture-based Approach
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Challenges
LargenessFault trees using BDD
Markov model
Non-exponential distributions
DES (Discrete-Event Simulation)
Joint continuous & discrete state spaces
Combined performance and reliability/availability MRM
FSPN
NHCTMC
SDE
Automated generation/solution based SPN, SRN
Hierarchical methods, fixed point iteration
SMP, MRGP
SRN
HS
Stiffness
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Applications For high-performance and high-dependability Software
Software reliability (FR,FF) Software rejuvenation (FF,FR) Software fault-tolerance (FT,FF)
Hardware (with software) Preventive maintenance (FR,FF) Availability model (FF) Upgrade design (FT,FF) Performability of wireless communications (FF) Transient behavior of ATM networks with failure
(FF) Error recovery in communication networks (FF)
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Software Reliability
Early prediction of quality based on software product/process metrics.
Reliability growth modeling based on failure and coverage data collected during testing
Architecture-based software reliability Developed a tool (SREPT)
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Software Rejuvenation
Definition: proactive fault management technique for the systems to counteract the effect of aging
Approaches: Periodic: Rejuvenation at regular (deterministic) time
intervals irrespective of system load Periodic and instantaneous load: Rejuvenation at
regular intervals and wait until system load is zero Prediction-based: Rejuvenation at times based on
estimated times to failure (implemented in IBM Netfinity Director)
Purely time-based estimation Time and workload-based estimation
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Analytical ModelingUndergoing Undergoing rejuvenationrejuvenationRecoveringRecovering AvailableAvailable
BB CCAA
Subordinated non-homogeneous CTMCSubordinated non-homogeneous CTMC for for t = t =
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Numerical Example
Service rate and failure rate are functions of time, (t) and (t)
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Measurement-based Estimation
Objectivedetection and validation of software aging
Basic ideaperiodically monitor and collect data on the attributes responsible for determining the health of the executing software
Quantifying the effect of aging proposed metric - Estimated time to
exhaustion
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Non-Non-parametricparametric Regression Smoothing of Regression Smoothing of Rossby data - Real Memory FreeRossby data - Real Memory Free
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Software Fault Tolerance Recovery block: common-mode software failures, common-mode acceptance test failures, failures clustered in the input stream A generalized model for recovery block
strategy to capture the dependencies: operate input within a recovery block
execution evaluate module output in the recovery block difficult inputs clustered in input space
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
SRN Model for Recovery Block
A recovery blockwith clustered failures
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Hardware Maintenance Conditioned based maintenance: CTMC: closed form solution SHARPE: numerical solution Find optimal inspection interval
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Hardware Maintenance
Comparison:Closed form result& Numerical result
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Hardware Maintenance Time based maintenance: SMP based solution (PM: preventive maintenance) Find optimal maintenance interval
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Availability Models
Availability models commonly used in practice assume that times to outage and recovery are exponentially distributed.
How accurate will the all-exponential models be for systems w/ limited information of outage/recovery?
Can we give availability bounds for such systems?
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Result of a General Model For a system with multiple types of outage-recovery
(non-exponentially distributed outrages), the underlying stochastic process is a semi-Markov process (SMP).
We give a closed-form formula of system availability.
Findings –
Only the mean value of time-to-recovery (E[TTR]) affects the availability of systems. The distribution does not matter.
However, the distribution of Time-to-outage (TTO) does affect availability.
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Upgrade with Redundancy
Active node
Standby node
Load-sharing clustering is common in networks (wired, wireless, optical) and server systems.
How to take advantage of the cluster structure and redundancy to upgrade hardware and software components?
How to quantify system performance for different upgrade schemes?
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Upgrade Schemes
Baseline model
Direct transfer:
Simple, but long downtime, for non-realtime-critical system.
Phased upgrade:
Almost zero downtime, upgrade paradox, version incompatibility, etc.
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Performability of Cellular Control Channel Protection
Each cell has Nb base repeaters (BR)Each BR provides M TDM channelsOne control channel resides in one of the BRs
Control channel down
System down(!)
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Automatic Protection Switch
Upon control_down, the failed control channel is automatically switched to a channel on a working base repeater.
control_down causes only system partially downno longer a full outage!
Asys Pblock PdropAPS
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Model of System w/o APS
CTMC State (b,k)b=No. of BR upk=No. of talking
channels
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Model of System w/ APS
A Segment of the Composite Markov Chain Model
CTMC State (b,k)b=No. of BR upk=No. of talking
channels
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Numerical Results
Handoff Call Blocking
Probability
Improvement by APS
Unavailability in handoff call
dropping probability
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
ATM Networks under Overloads
To quantify the effects of transients in ATM networks with • Markov Modulated Poisson (MMPP) used to allow for Correlated arrivals• Transient effects: relaxation time, maximum overshoot,and Expected excess number of losses in overload
ON0FF
1
0
10Low
High
2-state MMPP traffic model
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Queuing Model for ATM
A failure occurs in the connection (N1 and N3).Therefore, the traffic destined for N3 is re-routed
through N2
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
SRN Model for ATM
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Numerical Results
Loss probability vs. burstiness Queue length vs. burstiness
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Error Recovery in Communication
1:N protection switching:
Study error recovery in communication networks using (non-exponential detection/restoration times) MRGP
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Two Modeling Methods
CTMC is used with ignoring the detection and restoration times
MRGP (a non-Markovianmodeling) is used with including the detection and restoration times
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Numerical Result
Comparison of CTMC and non-Markovian model
Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke
University
Open Problems Theoretical studies in the modeling and
analysis of complex systems and failures Capabilities, limitations, and relationships among
formalisms such as FSPN and HS, FSPN and SDE. Fast algorithms
Fast algorithms for FSPN Fast algorithms for discrete-event simulation Fast algorithms for non-Markovian models
Applications Software: further exploration of software rejuvenation Performability analysis of (wired and wireless)
networks