lecture 4: state-based methods cs 7040 trustworthy system design, implementation, and analysis...

Lecture 4: State-Based Methods

CS 7040

Trustworthy System Design, Implementation, and Analysis

Spring 2015, Dr. Rozier

Adapted from slides by WHS at UIUC

Introduction

Availability: A motivation for State-Based Methods

• Recall that availability quantifies the alternation between proper and improper service.– A(t) is 1 if service is proper, 0 otherwise.– E[A(t)] is the probability that the service is proper

at time t.– A(0,t) is the fraction of time the system delivers

proper service during [0,t]

Availability

• For many systems, availability is a more “user-oriented” measure than reliability.– It is often more difficult to compute, since it must

account for repair and/or replacement.

Availability: Example

• A radio transmitter has an expected time to failure of 500 days. Replacement takes an average of 48 hours.

• A cheaper, and less reliable, transmitter has an expected time to failure of 150 days. Due to cost, a replacement is readily available and replacement takes 8 hours on average.




• Reliability, – for the first transmitter MTTF = 500 days– for the second transmitter MTTF = 150 days

First transmitter is 3.33x more reliable.




• For for the more reliable transmitter, and for the less reliable transmitter.




• For for the more reliable transmitter, and for the less reliable transmitter.

Higher reliability doesn’t necessarily mean higher availability!

Availability Modeling using Combinatorial Methods

• Availability modeling can be done with combinatorial methods, but only with independent repair assumptions, and exponential life-time distributions.

• This uses the theory of ON/OFF processes

“On” time distribution is reliability distribution, obtained using combinatorial methods. We represent mean as E[On].

“Off” distribution is the repair time distribution. Mean is E[Off].


Availability is the fraction of the time in the On state.

• Asymptotically, if instances of On periods are independent and identically distributed (i.i.d.) and instances of Off periods are i.i.d. then:


• Lots of assumptions that may not hold!

State-Based Methods• More accurate modeling can be achieved with state-based methods.

– State-based methods relax the independence assumptions needed for combinatorial modeling!

– Failures need not be independent. Failure of one component may make another component more or less likely to fail!

– Repairs need not be independent. Repair and replacement strategies are an important component that must be modeled in high-availability systems.

– High availability systems may operate in a degraded mode. In a degraded mode, the system may deliver only a fraction of its services. The repair process may start only when a system is sufficiently degraded.

Repairs and Independence

NCSA’s Blue Waters

• 288 cabinets• 26.4 PB of usable storage


• 288 cabinets• 26.4 PB of usable storage

• Let’s say the MTTF of a disk is 3 years.


• 26.4 PB of usable storage• Let’s say the MTTF of a disk is 3 years.• 2TB per disk


• 26.4 PB of usable storage• Let’s say the MTTF of a disk is 3 years.• 2TB per disk• > 27,000 TB of usable storage• ~ 34,000 TB of actual storage (RAID)



… 17,000 disks



… 17,000 disks Rate of failure for ANY disk


• We are losing 15+ drives every day.

• But loss of a drive isn’t necessarily loss of data, right?– 17,000 drives in 8+2 means 1,700 RAID groups.– Chance of RAID failure on day 1 is only 6%.– 24% on day 2

• Rather than replace every driveimmediately when it fails, let’sreplace many drives all atonce.

State-Based Methods

• We use random processes to model these systems.

• We use “state” to “remember” the conditions leading to dependencies.

Random Processes

• A random process is a collection of random variables indexed by time.– Useful for characterizing the behavior of real

systems.

• X(t) is a random process. Let X(1) be the result of tossing a die. Let X(2) be the result of tossing a die plus X(1), and so on. Notice that time (T) = {1, 2, 3, …}

Random Processes

• X(t) is a random process. Let X(1) be the result of tossing a die. Let X(2) be the result of tossing a die plus X(1), and so on. Notice that time (T) = {1, 2, 3, …}

P[X(2) = 12] = 1/36P[X(3) = 14|X(1) = 2] = 1/36E[X(n)] = 3.5n

Etc.

Random Processes

• If X is a random process, X(t) is a random variable

• A random variable Y is a function that maps

• A random process X maps elements in the two dimensional space to elements in

Random Processes

• A sample path of X is the history of sample space values X adopts as a function of time.

• When we fix t, then X becomes a function • When we fix X becomes a function• By fixing (e.g., the system is available) and observing X as a

function of T, we see a trajectory of the process samplingor not.

Describing a Random Process

• Recall: for a random variable X, we can use the cumulative distribution to describe the random variable.

• In general, no such simple description exists for a random process.

Describing a Random Process

• A random process can often be described succinctly in various ways, for example, say Y is a random variable representing the roll of a die, and X(t) is the sum after t rolls. We can describe X(t) as:

X(t) – X(t-1) = YP[X(t) = i|X(t-1) = j] = P[Y = i – j],

Or where eachIs independent.

Classifying a Random Process:Characteristics of T

• If the number of points defined by a random process, i.e. |T|, is finite or countable then the random process is said to be a discrete-time random process.

• If |T| is uncountable then random process is said to be a continuous-time random process.

Let X(t) be the number of fault arrivals in a system up to time t. What time of process is X(t)?

Classifying a Random Process:State Space Type

• Let X be a random process. The state space of a random process is the set of all possible values that the process can take on, i.e.

If X is a random process that models a system, then the state space of X can represent the set of all possible configurations that the system could be in.


• If the state space, S, of a random process, X, is finite or countable, then X is said to be a discrete-state random process.

• If the state space S of a random process X is infinite and uncountable then X is said to be a continuous-state random process.


• Let X be a random process that represents the number of bad packets received over a network. – Classify the state space.


• Let X be a random process that represents the voltage on a telephone line. – Classify the state space.


We will concern ourselves primarily with discrete-state processes.

Random Process Classification Examples

Markov Process

• A special type of random process that we will examine is called a Markov process. A Markov process can be defined, informally, as follows:

Given the state of a Markov process X at time t, the future behavior of X can be described completely

in terms of X(t).

Markov processes have the very useful property that their future behavior is independent of past values.

Markov Chains• A Markov chain is a Markov process with a discrete state

space.We will always make the assumption that a Markov chain has a

state space in {1, 2, …} and that it is time-homogeneous.

A Markov chain is time-homogenous if its future behavior does not depend on what time it is, only the current state.

We formalize this property by looking at a discrete-time Markov chain (DTMC). A DTMC X has the following property:

DTMCs

• Given i, j, and k, is a number.• can be interpreted as the probability that

X has value i, then after k time-steps, X will have value j.

Frequently we write to mean

DTMC

• Suppose we have a processor, it can be working or failed at some time T.

DTMC

• Suppose we have two processors, which can both be working or failed, independently, at time T.

State Occupancy Probability Vector• Let be a row vector. We denote to be the i-th

element of the vector. If is a state occupancy probability vector, thenis the probability that a DTMC is in state i at time-step k.

• Assume that a DTMC X has a state-space of size n, i.e. S = {1, 2, …, n}. We say formally:

Note that

State Occupancy Probability Vector

Computing a single step forward in time.

Given an initial which is the initial probability vector (where we are likely to be at t=0), andhow do we compute ?

Recall the definition of


Since


• We have which holds for all j.

• This resembles vector-matrix multiplication. If we arrange the matrix

Then


• The important consequence of this is that we can easily specify a DTMC in terms of an occupancy probability vector, and a transition probability matrix P.

Transient Behavior of DTMCs

A Simple Example

• Suppose the weather in Cincinnati can be modeled the following way:

Simple Example, cont.

Simple Example, Solution

Solution, cont.

Graphical Representation

Simple Computer Example

Limiting Behavior of DTMCs

For next time

• Apply for a Mobius account• Use your uc.edu e-mail

address, specifically mention this class and instructor.

lecture 4: state-based methods cs 7040 trustworthy system design, implementation, and analysis...

Documents

reliable transmitter

expected time

transmitter mttf

daysfirst transmitter

examplea radio transmitter

time t

fraction of time

repair time distribution