resisting)reliability)degradation) through)proactive...
TRANSCRIPT
RESISTing Reliability Degradation through Proactive ReconfigurationD. Cooray, S. Malek, R. Roshandel, and D. KilgoreSummarized by Haoliang Wang
September 28, 2015
MotivationAn emerging class of system -‐ Situated Software System◦ Predominantly pervasive, embedded and mobile◦ Software system is subject to dynamical contextual changes◦ Most applications like emergency response are mission-‐critical – Reliabilitymatters
Reliability analysis at design-‐time is insufficient◦ System reliability (and other QoS) depends on its runtime characteristics◦ Adaptation at runtime is necessary
Adaptation using reactive approach ◦ Adapts to changes after degradation – not good enough◦ Prediction-‐based proactive adaptation is preferred
Challenges§Proactively re-‐configure the system before performance degradation
§Effectively estimate the reliability of a complex system at runtime
§Determine the optimal system architecture at runtime
RESIST FrameworkResilient Situated Software System◦ Component-‐level Reliability Analyzer◦ Configuration Reliability Analyzer◦ Configuration Selector
Context-‐Aware Middleware◦ Provides support for execution, monitoringand adaptation of a software system
RESIST Framework (Cont. )RESIST is Goal Management layer solution in the three layer architectural model for self-‐managed system
RESIST Framework (Cont. )System Model◦ The system is divided into several functional componentswhich have their own reliability◦ Each component is allocated to a process◦ The system reliability is determined by the architecture, the individual components, and the context
Failure Model◦ Fail-‐stop – detectable by middleware facilities◦ Component failureEffects are contained within the boundary of component
◦ Process failureOccurs when one of its components exits prematurely.Other components running on it will also fail
Component-level AnalysisDiscrete Time Markov Chain (DTMC)◦ Estimate the component reliability ◦ A stochastic process with a set of states S = {S1, S2, S3, …, SN}
◦ Transition matrix A = {aij}, where aij is the probability of transitioning from Si to Sj
◦ Reliability of the component is computedby solving the steady state probability of not being in any failure state
How to derive the transition matrix A?
Component-level Analysis (Cont. )Hidden Markov Models (HMMs)◦ Learn from the runtime data and estimate the transition probability matrix
◦ A stochastic process with a set of states S = {S1, S2, S3, …, SN}
◦ Transition matrix A = {aij}, where aij is the probability of transitioning from Si to Sj
◦ A set of observations O = {O1, O2, O3, …, OM}◦ Observation matrix E = {eik}, where eik is the probability of observing event Ok in state Si
Baum-‐Welch algorithm is used to train and solve the HMM and obtain the converged transition matrix A
Component-level Analysis (Cont. )An example for estimating component reliability◦ A robot controller behavior model◦ States S = {idle, estimating, planning, moving, failed}◦ Running Baum-‐Welch algorithm on the observation sequence and we can obtain the transition matrix A
◦ Solve for the steady state probability vector[0.1966, 0.2238, 0.3849, 0.1914, 0.0033]
◦ Controller component reliability is 1-‐ 0.0033 = 99.67%
Component-level Analysis (Cont. )Estimate the near future by incorporating the context◦ Define a set of contextual parameters C = {C1, C2, …, Cx}◦ If akj is a transition probability from state Sk to state Sj in matrix A which is affected by changes in a specific contextual parameters Cn, then
a’kj = μ(akj, ΔCn), where μ is a context-‐specific function quantifying the impact of contextual change on the transition probability.
◦ The remaining transition probabilities in the row are adjusted proportionately such that: a’kj + akf + Σa’km = 1.
Configuration-level AnalysisMarkov-‐based system-‐level reliability estimation◦ System reliability is estimated compositionally based on the reliability of individual components
◦ Map the components and the interactions between them into a DTMC, where a state is one or more components in concurrent execution
◦ System reliability is computed as,
where 𝑀 is a 𝑘×𝑘matrix whose elements are,
where 𝑅% is the reliability of state 𝑠% and 𝐸 is the determinant of the remaining matrixexcluding the last row of the first column of (𝐼 − 𝑀)
𝑅 = (−1)./0𝑅.𝐸
𝐼 −𝑀
𝑀 𝑖, 𝑗 = 4𝑅%𝑃%6 , 𝑠% 𝑟𝑒𝑎𝑐ℎ𝑒𝑠 𝑠6 𝑎𝑛𝑑 𝑖 ≠ 𝑘0 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Configuration-level Analysis (Cont. )An example for estimating system reliability◦ Suppose we obtain the initial component reliabilityfor the Controller and Navigator to be
and assume others are 100% reliable◦ Based on the observed data, we can obtain thetransition probability for each state and therefore M
◦ Solving the model yields a system reliability of 93.85%
𝐶 = 0.9967,𝑁 = 0.9751
Configuration-level Analysis (Cont. )Impact of architectural style◦ E.g., Replicating components to improve system reliability
Configuration-level Analysis (Cont. )Impact of deployment architecture◦ E.g., Reallocating components to different processes to improve system reliability
Configuration SelectionConfiguration selection as an optimization problem◦ The optimal configuration in RESIST is defined as one that satisfies the system’s reliability requirement, while improving other quality attributes of concern
◦ In other words, given the decision variables,𝑝% ∈ 𝛧/ represents the number of replicas for component 𝑖𝑥%6 ∈ [0, 1] indicates if component 𝑖 is placed on process 𝑗
the objective is to find an architectural configuration 𝐶∗ such that,
where 𝑈S is a utility function indicating the preference for quality attribute 𝑞𝑅(𝐶) is the expected reliability of a given architecture 𝐶
𝐶∗ = 𝑎𝑟𝑔𝑚𝑎𝑥(W) X 𝑈S(𝐶)∀S ∈ Z[\]%^_ `a6bc^%dbe
𝑠. 𝑡. ∀𝑖 ∈ 1,… , 𝑡 , 𝑝% ≤ 𝑤%, 𝑤 ∈ 𝛧/∀𝑖 ∈ 1, … , 𝑡 , ∑ 𝑥%6i
6j0 = 1𝑅 𝐶 ≥ 𝛿, 𝛿 𝜖 ℝ, 0 < 𝛿 ≤ 1
Configuration Selection (Cont. )Configuration reliability R(C)◦ Assume the component may either be replicated or share a process with other componentsExpress with a binary variable 𝑞% = 1 if 𝑖^i component shares a process; 0 if otherwise.
𝑞% = 1−X 𝑥%6p (1 − 𝑥.6)^
.q%
i
6j0◦ Thus, the effective reliability of component i is,
𝑟%rss = 𝑞%𝑟%tuvwr + (1 − 𝑞%)𝑟%wrywhere,
𝑟%tuvwr =X 𝑟%𝑥%6p [𝑟.𝑥%6 + (1 − 𝑥.6)]^
.q%
z
6j0
𝑟%wry = 1 − 1− 𝑟%0/{|
◦ Finally, the system reliability can be computed as specified in configuration-‐level analysis
Configuration Selection (Cont. )Time-‐complexity analysis◦ Suppose we have
P = number of processesC = number of componentsN = maximum number of replicas
◦ This implies that there𝑂(𝑃W) ways of allocating components to processes𝑂(𝑁W) ways replicating components
◦ Therefore, total possible configuration is 𝑂((𝑁𝑃)W) – NP Problem
However the solution space may be significantly pruned by imposing architectural constrains
EvaluationImplementation◦ Mobile emergency response system prototype◦ XTEAM is used to control system’s operational profile◦ Prism-‐XM is used to gather the runtime data◦ Matlab is used to generate and solve HMM model
Evaluation Criteria◦ Validity of reliability predictions◦ Effectiveness of proactive re-‐configuration◦ Performance overhead
Evaluation (Cont. )Validity of Reliability Prediction◦ Use Bump Probability as the contextual parameter which affect the transition probability from moving state to estimating.
Evaluation (Cont. )Proactive Reconfiguration
Evaluation (Cont. )Overhead of Component Reliability Analysis
SummaryRESIST is framework that maintain the reliability of the situated software system through proactive reconfiguration of the software architecture
Three major components◦ Component reliability analysis◦ Configuration reliability analysis◦ Configuration selector
Three key contributions◦ Incorporation of multiple sources of information, particularly contextual information◦ Automatically find the optimal architectural configuration◦ Proactively adapt the system before the system’s reliability degrades