1 some thoughts for the industry session prof. kishor s. trivedi department of electrical and...
TRANSCRIPT
1
Some thoughts for the industry session
Prof. Kishor S. Trivedi Department of Electrical and Computer EngineeringDuke UniversityDurham, NC 27708-0291Phone: (919)660-5269e-mail: [email protected] present: visiting Professor IIT Kanpur, CSE Dept.
Cochin ConferenceDec 18, 2002
2
What does industry want?
Well trained students Short term research problems solved Short courses on timely topics
3
What do faculty want?
Funding for `their’ research Place their students in good company labs Hope to get their research results
transferred to industry To get to know important and difficult
problems that can drive their research
4
Some lessons learned Student placement should be guided by the advisor Start early with summer internship Patience is needed in listening to problems from
industry Patience is needed in getting the IP problems resolved Expect to do at least 50% more work than the funding
provided Tech transfer is a double edged sword Practical problems can give rise to respectable research
papers Short courses are ideal entry points
5
Characteristics of the Systemsbeing Studied
Redundancy: Hardware (Static,Dynamic), Information, Time
Fault Types: Permanent, Intermittent, Transient, Design
Fault Detection, Automated Reconfiguration Imperfect Coverage Maintenance: scheduled, unscheduled
Dependability (Reliability, Availability, Safety):
6
Performance: Resource Contention, Concurrency and Synchronization Timeliness (Have to Meet Deadlines)
Composite Performance and Dependability: Degradable Levels of Performance
Need Techniques and Tools that can Evaluate: Systems with All the Characteristics Above
Explicitly Address Complexity
Characteristics of the Systemsbeing Studied
7
MEASURES TO BE EVALUATED Dependability
Reliability: R(t), System MTTF Availability: Steady-state, Transient, Interval Safety
“Does it work, and for how long?'' Performance
Throughput, Loss Probability, Response Time
“Given that it works, how well does it work?''
8
MEASURES TO BE EVALUATED Composite Performance and Dependability
“How much work will be done(lost) in a given interval including the effects of failure/repair/contention?'' Need Techniques and Tools That Can Evaluate
Performance, Dependability and Their Combinations
9
PURPOSE OF EVALUATION
Understanding a System Observation
Operational Environment
Controlled Environment Reasoning
A Model is a Convenient Abstraction
10
PURPOSE OF EVALUATION
Predicting Behavior of a System
Need a Model
Accuracy Based on Degree of Extrapolation All Models are Wrong; Some Models are Useful Prediction is fine as long as it is not about the
future
11
Methods of Quantitative EVALUATION
Measurement-Based
Most believable, most expensive
Not always possible or cost effective during
system design
12
Methods of Quantitative Evaluation(Continued)
Model-Based
Less believable, Less expensive
1. Discrete-Event Simulation vs. Analytic
2. State-Space Methods vs. Non-State-Space Methods
3. Hybrid: Simulation + Analytic (SPNP)
4. State Space + Non-State Space (SHARPE)
13
Why MODEL? Provides a framework for gathering, organizing,
understanding and evaluating information about a system e.g. Zitel, US&S,HP
A cost-effective means to evaluate a system e.g. Boeing, US&S, HP,IBM, Motorola, Cisco,SUN
14
Why MODEL? (continued)
Provides a means of evaluating a set of alternatives in a structured and quantitative manner e.g. Zitel, DEC,HP
Sometimes needed due to legal and contractual obligations e.g. FAA
Sometimes needed for business reasons: Motorola, SUN, Cisco
15
Compare two CLIENT-SERVER Architectures
Architecture 2
Architecture 1
16
0
)( dttRMTTF
Compare Connection Reliabilities
Connection reliability R(t) is the probability that throughout the interval [0,t) at least one path exists from the client to server on which all components are operational.
From R(t), system mean time to failure can be computed:
17
Compare Connection Reliabilities
18
)(lim tAAt
Compare Connection Availabilities
Connection (instantaneous, transient or point)
availability A(t) is the probability that at time t at
least one path exists from the client to server on
which all components are operational.
A(t)R(t) and limiting or steady-state Availability
19
Compare Connection Availabilities
20
MODELING THROUGHOUT SYSTEM LIFECYCLE
System Specification/Design Phase
Answer “What-if Questions'' Compare design alternatives
(Zitel,HP,Motorola) Performance-Dependability Trade-offs
(DEC) Design Optimization (wireless handoff)
21
MODELING THROUGHOUT SYSTEM LIFECYCLE Design Verification Phase
Use Measurements + Models
E.g. Fault/Injection + Reliability Model
Union Switch and Signals, Boeing, Draper
Configuration Selection Phase: DEC
System Operational Phase: Lucent
• It is fun! It is fun!
22
CASE STUDY: ZITEL
Comparison of two different fault-tolerant RAMdisks.
Stochastic Petri Net Package (SPNP) was used to model the two systems for their reliability.
23
CASE STUDY: ZITEL Trivedi worked with the designers directly:
Model Validation was done using face validation and sanity
checks.
Parameterization was easy due to the experience of the designers.
One difficult research problem originated from the study;
Subsequently solved and published in Microelectronics and
Reliability journal.
24
CASE STUDY: VAXCLUSTER
Developed three models of Processor Subsystem:
Two-Level Decomposition (IEEE-TR, Apr 89)
Inner Level: 9-state Markov
Outer level: n parallel diodes
A Detailed SPN Model (PNPM 89)
A Detailed SPN model for Heterogeneous Cluster (Averesky
book)
25
CASE STUDY: VAXCLUSTER
Storage Subsystem Model: A fixed-point iteration over a set of Markov submodels. (IEEE-TR, to appear)
Observed that availability is maximized with 2 processors (HCSS 90)
Many interesting reliability, availability, performability measures computed
26
Case Study: HP
Cluster Availability Modeling
Server Availability
Mass Storage Arrays Availability Modeling
Started with Markov chains via SHARPE
Progressed toward Stochastic Petri Nets
and Stochastic Reward nets via SPNP
27
CASE STUDY: LUCENT
A Validated Model of Hardware-Software Availability.
Worked with V. Mendiratta of Naperville. Model is semi-Markov; solved using SHARPE. Parameters collected form field data. Model results validated against actual
measurements.
28
CASE STUDY: LUCENT, IBM, Motorola, SUN
Software Rejuvenation: A technique to counter software “aging” and increase its
availability to clients. Evaluated optimum rejuvenation interval which
maximizes steady state availability (minimizes expected cost) for IBM cluster, Motorola CMTS cluster
Collected data from real systems to show aging and to determine proactive fault management strategies. Worked in our lab, with SUN Microsystems
29
CASE STUDY: MOTOROLA
Availability & Performability Modeling: Modeled several configurations of Communication
Enterprise Common Platform. Practical approaches for approximating steady state
measures in large, repairable, and highly dependable system: model decomposition, state space truncation, etc.
Both SHARPE and SPNP used.
30
CASE STUDY: MOTOROLA
Recovery strategies in wireless handoff:
proposed and modeled several strategies
a patent being filed by Motorola
SPNP was used
Hierarchy of two-level models used
Fixed-point iteration was used
31
CASE STUDY: BELLCORE
Architecture-based software reliability:
proposed a methodology
applied the methodology to SHARPE
used Bellcore’s test coverage tool, ATAC, to parameterize
the model
Bellcore is currently enhancing ATAC to incorporate our
methodology
32
CASE STUDY: DRAPER LAB
Overall aim was Verification of system with very
high reliability/availability specifications. Prototype
under consideration was FTPP cluster 3.
Hybrid approach proposed Fault injection based measurements.
Statistical analysis of measured data to enable
parameterization of analytical models.
33
CASE STUDY: DRAPER LAB
Reliability modeling of the prototype done: Parameterization done with the aid of existing reliability databases. Analytical solution provided exact closed form
expressions Markov model solved using SHARPE Petri net model solved using SPNP Reliability bottlenecks found
34
CASE STUDY: AT & T
GSHARPE: A Preprocessor to SHARPE developed at Bell Labs
by a Duke Student. User can specify Weibull Failure times and
lognormal and other repair time distributions. GSHARPE fits these to phase type distributions and
produces a Markov model that is generated for processing by SHARPE
35
CASE STUDY: BOEING
An Integrated Reliability Environment A working prototype Developed a high-level modeling language
(SDM) Designed and implemented an intelligent
interpreter
36
CASE STUDY: BOEING (Continued)
Interpreter determines which solution method is applicable
Five different modeling engines are integrated: CAFTA, SETS, EHARP, SHARPE and
SPNP.
37
QUANTITATIVE EVALUATION TAXONOMY
Closed-form solution
Numerical solution using a tool
38
MODELING TAXONOMY
39
STATE SPACE MODELING TAXONOMY
40
ANALYTIC MODELING TAXONOMY
NON-STATE SPACE MODELING TECHNIQUES
SP reliability block diagrams
Non-SP reliability block diagrams
Product form queuing models
41
State Space Modeling Taxonomy
State space methods
Markovian modeling
non-Markovian modeling
discrete-time Markov chains
continuous-time Markov chains
Markov reward models
Semi-Markov models
Markov regenerative models
Non-Homogeneous Markov
42
Transition label: Probability: (homogeneous) discrete-time Markov
chain (DTMC) Time-independent Rate: homogeneous continuous-
time Markov chain Time-dependent Rate: non-homogeneous
continuous-time Markov chain Distribution function: semi Markov process Two Dist. Functions: Markov Regenerative Process
State-Space Based Models
43
IN ORDER TO FULFILL OUR GOALS OF
Modeling Performance, Dependability and
Performability
Modeling Complex Systems
We Need
Automatic Generation and Solution of Large
Markov Reward Models
44
IN ORDER TO FULFILL OUR GOALS OF
Facility for State Truncation, Hierarchical composition
of Non-State-Space and State-Space Models, Fixed-
Point Iteration There are Two Tools that Potentially meet these Goals
Stochastic Petri Net Package (SPNP)
Symbolic Hierarchical Automated Rel. and Perf.
Evaluator (SHARPE)
45
MODELING SOFTWARE PACKAGES HARP - Hybrid Automated Reliability Predictor (Duke Univ, funded by NASA
Langley)
SAVE - System Availability Estimator (Duke Univ. funded by
IBM)
SHARPE - Symbolic Hierarchical Automated Reliability and Performance Evaluator;
installed at nearly 280 locations (GUI available)
SPNP - Stochastic Petri Net Package installed at nearly 120 locations (iSPN - GUI available)
D_RAMP for Union Switch and Signals by Duke, UVA and CMU
SDM - Boeing Integrated Reliability Modeling Environment (Jointly developed by Duke
Univ., Univ. of Wash. and Boeing)
SDDS - Developed by Sohar with the help from K. Trivedi
SREPT - Software Reliability Estimation and Prediction Tool
Challenges in Modeling
47
COMPLEXITIES OF MODELS
Large State Space
Model construction problem
Model solution problem
Model Stiffness.
Fast and slow rates acting together
Failure And Recovery/Repair
Performance and failure
48
COMPLEXITIES OF MODELS Modeling Non-Exponential Distributions
Combining performance and reliability
Believability/Understandability/Usability
Incorporation in the design process
Connection between measurements & models:
Parameterization
Validation
49
LARGENESS TOLERANCE Automated Model Construction
Stochastic Petri nets (GreatSPN, SPNP, SHARPE,
DSPNexpress, ULTRASAN)
High level languages (SAVE, QNAP, ASSIST, SDM)
Fault-Tree + Recovery Info (HARP)
Object-Oriented Approaches (TANGRAM)
Loops in the specification of CTMC (SHARPE)
50
LARGENESS TOLERANCE Efficient numerical solution techniques
Sparse Storage
Accurate and Efficient Solution Methods
We have Generated and Solved Models
with 1,000,000 states (has gone up
considerably recently)
Steady-State : NEAR-Optimal SOR
Transient: Modified Jensen's method
51
MODEL SPECIFICATION LANGUAGES
Different languages can be used to specify a
single model type:
SAVE,QNAP,SPNP all appear very different;
underlying model type is Markov
Same language can be used to specify
different model types:RESQ input language
used for PFQN or EQN
52
LARGENESS AVOIDANCE
Non-State-Space methods Reliability block diagrams
Fault-trees
Product-Form Queuing Networks
Approximate solutions State Truncation
SAVE, SPNP, ASSIST (Kantz and Trivedi: PNPM91)
53
LARGENESS AVOIDANCE Approximate solutions
Hierarchical Decomposition (Chapter 11)
and Fixed-Point Iteration among submodels:
Heidelberger and Trivedi; IEEE-TC,1983
(Queueing Models)
Ciardo and Trivedi; PNPM91 (SPN Models)
Tomek and Trivedi (Availability Models)
Singhal (IEEE-TPDS, 1992)
Chapter 11 of Sahner et al.
54
LARGENESS AVOIDANCE
Approximate solutions
Time-Scale Decomposition
Bobbio and Trivedi(IEEE-TC;1986); Section 11.2
Fluid Approximation:
Miltra; Kulkarni; Ciardo; Nicol, and Trivedi;
FSPN
Performability (Chapters 6 and 12)
55
Difficulties in Modeling Using MRMs
Stiffness
Causes numerical difficulties in solution
Stiffness Tolerance
Develop stiffness tolerant numerical
solution methods
Stiffness Avoidance
Avoid generating stiff models through
decomposition
56
STIFFNESS TOLERANCE
Automatic Detection of Stiffness (HARP)
Special Stable ODE Solver
Reibman and Trivedi (TR-BDF2)
Computers and Operations Research, 1988.
Malhotra and Trivedi (Pade, Implicit RK)
57
STIFFNESS TOLERANCE
Uniformization for Stiff Markov Chains
Muppala and Trivedi
We can solve models with rate ratios of 108 or
higher
Implemented in SHARPE & SPNP
58
STIFFNESS AVOIDANCE Model-level decomposition
Behavioral Decomposition (HARP, Bobbio &
Trivedi) Fault-Occurrence vs. Fault/Error Handling
Hierarchical Composition (SHARPE) Composition
of Submodel solutions without generating a single
one-level overall model
Fixed-Point Iteration (Ciardo and Trivedi; SPNP)
59
Non-Exponential Behavior
Non state space models: Fault Trees, Reliability
Graphs, RBDs; no problem
60
Non-Exponential Behaviorin State Space Models
61
NON-EXPONENTIAL DISTRIBUTIONS
Phase-Type Expansions
Malhotra and Reibman (GSHARPE)
See Figure 9.38 on p. 191(Red Book)
Non-Homogeneous Markov Chains
CARE III, HARP
Soft Reliability model with imperfect repairs
solved using SHARPE
62
NON-EXPONENTIAL DISTRIBUTIONS Semi-Markov Chains
Ciardo et al, IEEE-TC Oct. 90 Markov Regenerative Processes:
Choi, Logothetis, Kulkarni, Trivedi DSPN and MRSPN:
Choi, Kulkarni, Trivedi Discrete-Event Simulation
Now in SPNP (FSPN an Non-Markovian SPN
Simulation), RESQ, QNAP
63
BELIEVABILITYUNDERSTANDABILITY
Integration of Measurements and Models
Measurements Provide Parameters to Models
Models Provide Guidelines For Measurements
Models Validated Against Measurements
Integration of Different Modeling Tools
Boeing SDM project
IDEAS project at Duke
64
BELIEVABILITY/UNDERSTANDABILITY
Many Case-Studies of Validations Needed
Vaxcluster Availability Model: Wein & Sathaye
Hsueh, Iyer and Trivedi; IEEE-TC, Apr. 1988
AT & T Validation of ESS
Technology Transfer
Seminars and Workshops
Development and Dissemination of Tools
Application of the Techniques and Tools
65
MODELING AND MEASUREMENTS: INTERFACES
Measurements supply Input Parameters to Models
(Model Calibration or Parameterization)
Confidence Intervals should be obtained
Boeing, Draper, Union Switch projects
Model Sensitivity Analysis can suggest which
Parameters to Measure More Accurately: Blake,
Reibman and Trivedi: SIGMETRICS 1988.
66
MODELING AND MEASUREMENTS: INTERFACES
Model Validation
1. Face Validation
2. Input-Output Validation
3. Validation of Model Assumptions
(Hypothesis Testing)
Rejection of a hypothesis regarding model assumption
based on measurement data leads to an improved model
67
MODELING AND MEASUREMENTS: INTERFACES
Model Structure Based on Measurement Data
Hsueh, Iyer and Trivedi; IEEE TC, April 1988;
Gokhale et al, IPDS 98