lecture 4: state-based methods cs 7040 trustworthy system design, implementation, and analysis...
TRANSCRIPT
Lecture 4: State-Based Methods
CS 7040
Trustworthy System Design, Implementation, and Analysis
Spring 2015, Dr. Rozier
Adapted from slides by WHS at UIUC
Introduction
Availability: A motivation for State-Based Methods
• Recall that availability quantifies the alternation between proper and improper service.– A(t) is 1 if service is proper, 0 otherwise.– E[A(t)] is the probability that the service is proper
at time t.– A(0,t) is the fraction of time the system delivers
proper service during [0,t]
Availability
• For many systems, availability is a more “user-oriented” measure than reliability.– It is often more difficult to compute, since it must
account for repair and/or replacement.
Availability: Example
• A radio transmitter has an expected time to failure of 500 days. Replacement takes an average of 48 hours.
• A cheaper, and less reliable, transmitter has an expected time to failure of 150 days. Due to cost, a replacement is readily available and replacement takes 8 hours on average.
Availability: Example
• A radio transmitter has an expected time to failure of 500 days. Replacement takes an average of 48 hours.
• A cheaper, and less reliable, transmitter has an expected time to failure of 150 days. Due to cost, a replacement is readily available and replacement takes 8 hours on average.
• Reliability, – for the first transmitter MTTF = 500 days– for the second transmitter MTTF = 150 days
First transmitter is 3.33x more reliable.
Availability: Example
• A radio transmitter has an expected time to failure of 500 days. Replacement takes an average of 48 hours.
• A cheaper, and less reliable, transmitter has an expected time to failure of 150 days. Due to cost, a replacement is readily available and replacement takes 8 hours on average.
• For for the more reliable transmitter, and for the less reliable transmitter.
Availability: Example
• A radio transmitter has an expected time to failure of 500 days. Replacement takes an average of 48 hours.
• A cheaper, and less reliable, transmitter has an expected time to failure of 150 days. Due to cost, a replacement is readily available and replacement takes 8 hours on average.
• For for the more reliable transmitter, and for the less reliable transmitter.
Higher reliability doesn’t necessarily mean higher availability!
Availability: Example
• A radio transmitter has an expected time to failure of 500 days. Replacement takes an average of 48 hours.
• A cheaper, and less reliable, transmitter has an expected time to failure of 150 days. Due to cost, a replacement is readily available and replacement takes 8 hours on average.
• For for the more reliable transmitter, and for the less reliable transmitter.
Higher reliability doesn’t necessarily mean higher availability!
Availability Modeling using Combinatorial Methods
• Availability modeling can be done with combinatorial methods, but only with independent repair assumptions, and exponential life-time distributions.
• This uses the theory of ON/OFF processes
“On” time distribution is reliability distribution, obtained using combinatorial methods. We represent mean as E[On].
“Off” distribution is the repair time distribution. Mean is E[Off].
Availability Modeling using Combinatorial Methods
Availability is the fraction of the time in the On state.
• Asymptotically, if instances of On periods are independent and identically distributed (i.i.d.) and instances of Off periods are i.i.d. then:
Availability Modeling using Combinatorial Methods
• Lots of assumptions that may not hold!
State-Based Methods• More accurate modeling can be achieved with state-based methods.
– State-based methods relax the independence assumptions needed for combinatorial modeling!
– Failures need not be independent. Failure of one component may make another component more or less likely to fail!
– Repairs need not be independent. Repair and replacement strategies are an important component that must be modeled in high-availability systems.
– High availability systems may operate in a degraded mode. In a degraded mode, the system may deliver only a fraction of its services. The repair process may start only when a system is sufficiently degraded.
Repairs and Independence
Repairs and Independence
NCSA’s Blue Waters
• 288 cabinets• 26.4 PB of usable storage
NCSA’s Blue Waters
• 288 cabinets• 26.4 PB of usable storage
• Let’s say the MTTF of a disk is 3 years.
NCSA’s Blue Waters
• 26.4 PB of usable storage• Let’s say the MTTF of a disk is 3 years.• 2TB per disk
NCSA’s Blue Waters
• 26.4 PB of usable storage• Let’s say the MTTF of a disk is 3 years.• 2TB per disk• > 27,000 TB of usable storage• ~ 34,000 TB of actual storage (RAID)
NCSA’s Blue Waters
• 26.4 PB of usable storage• Let’s say the MTTF of a disk is 3 years.• 2TB per disk• > 27,000 TB of usable storage• ~ 34,000 TB of actual storage (RAID)
… 17,000 disks
NCSA’s Blue Waters
• 26.4 PB of usable storage• Let’s say the MTTF of a disk is 3 years.• 2TB per disk• > 27,000 TB of usable storage• ~ 34,000 TB of actual storage (RAID)
… 17,000 disks
NCSA’s Blue Waters
• 26.4 PB of usable storage• Let’s say the MTTF of a disk is 3 years.• 2TB per disk• > 27,000 TB of usable storage• ~ 34,000 TB of actual storage (RAID)
… 17,000 disks
NCSA’s Blue Waters
• 26.4 PB of usable storage• Let’s say the MTTF of a disk is 3 years.• 2TB per disk• > 27,000 TB of usable storage• ~ 34,000 TB of actual storage (RAID)
… 17,000 disks Rate of failure for ANY disk
NCSA’s Blue Waters
• We are losing 15+ drives every day.
• But loss of a drive isn’t necessarily loss of data, right?– 17,000 drives in 8+2 means 1,700 RAID groups.– Chance of RAID failure on day 1 is only 6%.– 24% on day 2
• Rather than replace every driveimmediately when it fails, let’sreplace many drives all atonce.
State-Based Methods
• We use random processes to model these systems.
• We use “state” to “remember” the conditions leading to dependencies.
Random Processes
• A random process is a collection of random variables indexed by time.– Useful for characterizing the behavior of real
systems.
• X(t) is a random process. Let X(1) be the result of tossing a die. Let X(2) be the result of tossing a die plus X(1), and so on. Notice that time (T) = {1, 2, 3, …}
Random Processes
• X(t) is a random process. Let X(1) be the result of tossing a die. Let X(2) be the result of tossing a die plus X(1), and so on. Notice that time (T) = {1, 2, 3, …}
P[X(2) = 12] = 1/36P[X(3) = 14|X(1) = 2] = 1/36E[X(n)] = 3.5n
Etc.
Random Processes
• If X is a random process, X(t) is a random variable
• A random variable Y is a function that maps
• A random process X maps elements in the two dimensional space to elements in
Random Processes
• A sample path of X is the history of sample space values X adopts as a function of time.
• When we fix t, then X becomes a function • When we fix X becomes a function• By fixing (e.g., the system is available) and observing X as a
function of T, we see a trajectory of the process samplingor not.
Describing a Random Process
• Recall: for a random variable X, we can use the cumulative distribution to describe the random variable.
• In general, no such simple description exists for a random process.
Describing a Random Process
• A random process can often be described succinctly in various ways, for example, say Y is a random variable representing the roll of a die, and X(t) is the sum after t rolls. We can describe X(t) as:
X(t) – X(t-1) = YP[X(t) = i|X(t-1) = j] = P[Y = i – j],
Or where eachIs independent.
Classifying a Random Process:Characteristics of T
• If the number of points defined by a random process, i.e. |T|, is finite or countable then the random process is said to be a discrete-time random process.
• If |T| is uncountable then random process is said to be a continuous-time random process.
Let X(t) be the number of fault arrivals in a system up to time t. What time of process is X(t)?
Classifying a Random Process:State Space Type
• Let X be a random process. The state space of a random process is the set of all possible values that the process can take on, i.e.
If X is a random process that models a system, then the state space of X can represent the set of all possible configurations that the system could be in.
Classifying a Random Process:State Space Type
• If the state space, S, of a random process, X, is finite or countable, then X is said to be a discrete-state random process.
• If the state space S of a random process X is infinite and uncountable then X is said to be a continuous-state random process.
Classifying a Random Process:State Space Type
• Let X be a random process that represents the number of bad packets received over a network. – Classify the state space.
Classifying a Random Process:State Space Type
• Let X be a random process that represents the voltage on a telephone line. – Classify the state space.
Classifying a Random Process:State Space Type
We will concern ourselves primarily with discrete-state processes.
Random Process Classification Examples
Markov Process
• A special type of random process that we will examine is called a Markov process. A Markov process can be defined, informally, as follows:
Given the state of a Markov process X at time t, the future behavior of X can be described completely
in terms of X(t).
Markov processes have the very useful property that their future behavior is independent of past values.
Markov Chains• A Markov chain is a Markov process with a discrete state
space.We will always make the assumption that a Markov chain has a
state space in {1, 2, …} and that it is time-homogeneous.
A Markov chain is time-homogenous if its future behavior does not depend on what time it is, only the current state.
We formalize this property by looking at a discrete-time Markov chain (DTMC). A DTMC X has the following property:
DTMCs
• Given i, j, and k, is a number.• can be interpreted as the probability that
X has value i, then after k time-steps, X will have value j.
Frequently we write to mean
DTMC
• Suppose we have a processor, it can be working or failed at some time T.
DTMC
• Suppose we have two processors, which can both be working or failed, independently, at time T.
State Occupancy Probability Vector• Let be a row vector. We denote to be the i-th
element of the vector. If is a state occupancy probability vector, thenis the probability that a DTMC is in state i at time-step k.
• Assume that a DTMC X has a state-space of size n, i.e. S = {1, 2, …, n}. We say formally:
Note that
State Occupancy Probability Vector
Computing a single step forward in time.
Given an initial which is the initial probability vector (where we are likely to be at t=0), andhow do we compute ?
Recall the definition of
State Occupancy Probability Vector
Since
State Occupancy Probability Vector
• We have which holds for all j.
• This resembles vector-matrix multiplication. If we arrange the matrix
Then
State Occupancy Probability Vector
• The important consequence of this is that we can easily specify a DTMC in terms of an occupancy probability vector, and a transition probability matrix P.
Transient Behavior of DTMCs
A Simple Example
• Suppose the weather in Cincinnati can be modeled the following way:
Simple Example, cont.
Simple Example, Solution
Solution, cont.
Graphical Representation
Simple Computer Example
Limiting Behavior of DTMCs
For next time
• Apply for a Mobius account• Use your uc.edu e-mail
address, specifically mention this class and instructor.