resisting)reliability)degradation) through)proactive...

22
RESISTing Reliability Degradation through Proactive Reconfiguration D. Cooray, S. Malek, R. Roshandel, and D. Kilgore Summarized by Haoliang Wang September 28, 2015

Upload: others

Post on 20-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RESISTing)Reliability)Degradation) through)Proactive ...menasce/cs788/slides/Wang-H-Resist-Cooray.pdf · States#S(=({idle,estimating,planning,moving, failed} Running#Baum1Welch#algorithm#on#the#observation#

RESISTing  Reliability  Degradation  through  Proactive  ReconfigurationD.  Cooray,  S.  Malek,  R.  Roshandel,   and  D.  KilgoreSummarized by Haoliang Wang

September 28, 2015

Page 2: RESISTing)Reliability)Degradation) through)Proactive ...menasce/cs788/slides/Wang-H-Resist-Cooray.pdf · States#S(=({idle,estimating,planning,moving, failed} Running#Baum1Welch#algorithm#on#the#observation#

MotivationAn  emerging  class  of  system  -­‐ Situated  Software  System◦ Predominantly  pervasive,  embedded  and  mobile◦ Software  system  is  subject  to  dynamical  contextual  changes◦ Most  applications  like  emergency  response  are  mission-­‐critical – Reliabilitymatters

Reliability  analysis  at  design-­‐time  is  insufficient◦ System  reliability  (and  other  QoS)  depends  on  its  runtime  characteristics◦ Adaptation  at  runtime  is  necessary

Adaptation  using  reactive  approach  ◦ Adapts  to  changes  after  degradation  – not  good  enough◦ Prediction-­‐based  proactive adaptation  is  preferred

Page 3: RESISTing)Reliability)Degradation) through)Proactive ...menasce/cs788/slides/Wang-H-Resist-Cooray.pdf · States#S(=({idle,estimating,planning,moving, failed} Running#Baum1Welch#algorithm#on#the#observation#

Challenges§Proactively  re-­‐configure  the  system  before  performance  degradation

§Effectively  estimate  the  reliability  of  a  complex  system  at  runtime

§Determine  the  optimal  system  architecture  at  runtime

Page 4: RESISTing)Reliability)Degradation) through)Proactive ...menasce/cs788/slides/Wang-H-Resist-Cooray.pdf · States#S(=({idle,estimating,planning,moving, failed} Running#Baum1Welch#algorithm#on#the#observation#

RESIST FrameworkResilient  Situated  Software  System◦ Component-­‐level  Reliability  Analyzer◦ Configuration  Reliability  Analyzer◦ Configuration  Selector

Context-­‐Aware  Middleware◦ Provides  support  for  execution,  monitoringand  adaptation  of  a  software  system

Page 5: RESISTing)Reliability)Degradation) through)Proactive ...menasce/cs788/slides/Wang-H-Resist-Cooray.pdf · States#S(=({idle,estimating,planning,moving, failed} Running#Baum1Welch#algorithm#on#the#observation#

RESIST Framework (Cont. )RESIST  is  Goal  Management  layer  solution  in  the  three  layer  architectural  model  for  self-­‐managed  system

Page 6: RESISTing)Reliability)Degradation) through)Proactive ...menasce/cs788/slides/Wang-H-Resist-Cooray.pdf · States#S(=({idle,estimating,planning,moving, failed} Running#Baum1Welch#algorithm#on#the#observation#

RESIST Framework (Cont. )System  Model◦ The  system  is  divided  into  several  functional  componentswhich  have  their  own  reliability◦ Each  component  is  allocated  to  a  process◦ The  system  reliability  is  determined  by  the  architecture,  the  individual  components,  and  the  context

Failure  Model◦ Fail-­‐stop  – detectable  by  middleware  facilities◦ Component  failureEffects  are  contained  within  the  boundary  of  component

◦ Process  failureOccurs  when  one  of  its  components  exits  prematurely.Other  components  running  on  it  will  also  fail

Page 7: RESISTing)Reliability)Degradation) through)Proactive ...menasce/cs788/slides/Wang-H-Resist-Cooray.pdf · States#S(=({idle,estimating,planning,moving, failed} Running#Baum1Welch#algorithm#on#the#observation#

Component-level AnalysisDiscrete  Time  Markov  Chain  (DTMC)◦ Estimate  the  component  reliability  ◦ A  stochastic  process  with  a  set  of  states  S  =  {S1,  S2,  S3,  …,  SN}

◦ Transition  matrix  A  =  {aij},  where  aij is  the  probability  of  transitioning  from  Si to  Sj

◦ Reliability  of  the  component  is  computedby  solving  the  steady  state  probability  of  not  being  in  any  failure  state

How  to  derive  the  transition  matrix  A?

Page 8: RESISTing)Reliability)Degradation) through)Proactive ...menasce/cs788/slides/Wang-H-Resist-Cooray.pdf · States#S(=({idle,estimating,planning,moving, failed} Running#Baum1Welch#algorithm#on#the#observation#

Component-level Analysis (Cont. )Hidden  Markov  Models  (HMMs)◦ Learn  from  the  runtime  data  and  estimate  the  transition  probability  matrix

◦ A  stochastic  process  with  a  set  of  states  S  =  {S1,  S2,  S3,  …,  SN}

◦ Transition  matrix  A  =  {aij},  where  aij is  the  probability  of  transitioning  from  Si to  Sj

◦ A  set  of  observations  O  =  {O1,  O2,  O3,  …,  OM}◦ Observation  matrix  E  =  {eik},  where  eik is  the  probability  of  observing  event  Ok in  state  Si

Baum-­‐Welch  algorithm  is  used  to  train  and  solve  the  HMM  and  obtain  the  converged  transition  matrix  A

Page 9: RESISTing)Reliability)Degradation) through)Proactive ...menasce/cs788/slides/Wang-H-Resist-Cooray.pdf · States#S(=({idle,estimating,planning,moving, failed} Running#Baum1Welch#algorithm#on#the#observation#

Component-level Analysis (Cont. )An  example  for  estimating  component  reliability◦ A  robot  controller  behavior  model◦ States  S  =  {idle,  estimating,  planning,  moving,  failed}◦ Running  Baum-­‐Welch  algorithm  on  the  observation  sequence  and  we  can  obtain  the  transition  matrix  A

◦ Solve  for  the  steady  state  probability  vector[0.1966,  0.2238,  0.3849,  0.1914,  0.0033]

◦ Controller  component  reliability  is  1-­‐ 0.0033  =  99.67%

Page 10: RESISTing)Reliability)Degradation) through)Proactive ...menasce/cs788/slides/Wang-H-Resist-Cooray.pdf · States#S(=({idle,estimating,planning,moving, failed} Running#Baum1Welch#algorithm#on#the#observation#

Component-level Analysis (Cont. )Estimate  the  near  future  by  incorporating  the  context◦ Define  a  set  of  contextual  parameters  C  =  {C1,  C2,  …,  Cx}◦ If  akj is  a  transition  probability  from  state  Sk to  state  Sj in  matrix  A which  is  affected  by  changes  in  a  specific  contextual  parameters  Cn,  then  

a’kj =  μ(akj,  ΔCn),  where  μ is  a  context-­‐specific  function  quantifying  the  impact  of  contextual  change  on  the  transition  probability.

◦ The  remaining  transition  probabilities  in  the  row  are  adjusted  proportionately  such  that:  a’kj +  akf +  Σa’km =  1.  

Page 11: RESISTing)Reliability)Degradation) through)Proactive ...menasce/cs788/slides/Wang-H-Resist-Cooray.pdf · States#S(=({idle,estimating,planning,moving, failed} Running#Baum1Welch#algorithm#on#the#observation#

Configuration-level AnalysisMarkov-­‐based  system-­‐level  reliability estimation◦ System  reliability  is  estimated  compositionally  based  on  the  reliability  of  individual  components

◦ Map  the  components  and  the  interactions  between  them  into  a  DTMC,  where  a  state  is  one  or  more  components  in  concurrent  execution

◦ System  reliability  is  computed  as,

where  𝑀 is  a  𝑘×𝑘matrix  whose  elements  are,

where  𝑅% is  the  reliability  of  state  𝑠% and  𝐸 is  the  determinant  of  the  remaining  matrixexcluding  the  last  row  of  the  first  column  of  (𝐼 − 𝑀)

𝑅 = (−1)./0𝑅.𝐸

𝐼 −𝑀

𝑀 𝑖, 𝑗 = 4𝑅%𝑃%6  , 𝑠%  𝑟𝑒𝑎𝑐ℎ𝑒𝑠  𝑠6  𝑎𝑛𝑑  𝑖 ≠ 𝑘0  ,                                                            𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Page 12: RESISTing)Reliability)Degradation) through)Proactive ...menasce/cs788/slides/Wang-H-Resist-Cooray.pdf · States#S(=({idle,estimating,planning,moving, failed} Running#Baum1Welch#algorithm#on#the#observation#

Configuration-level Analysis (Cont. )An  example  for  estimating  system  reliability◦ Suppose  we  obtain  the  initial  component  reliabilityfor  the  Controller and  Navigator  to  be

and  assume  others  are  100%  reliable◦ Based  on  the  observed  data,  we  can  obtain  thetransition  probability  for  each  state  and  therefore  M

◦ Solving  the  model  yields  a  system  reliability  of  93.85%

𝐶 = 0.9967,𝑁 = 0.9751

Page 13: RESISTing)Reliability)Degradation) through)Proactive ...menasce/cs788/slides/Wang-H-Resist-Cooray.pdf · States#S(=({idle,estimating,planning,moving, failed} Running#Baum1Welch#algorithm#on#the#observation#

Configuration-level Analysis (Cont. )Impact  of  architectural  style◦ E.g.,  Replicating  components  to  improve  system  reliability

Page 14: RESISTing)Reliability)Degradation) through)Proactive ...menasce/cs788/slides/Wang-H-Resist-Cooray.pdf · States#S(=({idle,estimating,planning,moving, failed} Running#Baum1Welch#algorithm#on#the#observation#

Configuration-level Analysis (Cont. )Impact  of  deployment  architecture◦ E.g.,  Reallocating  components to  different  processes  to  improve  system  reliability

Page 15: RESISTing)Reliability)Degradation) through)Proactive ...menasce/cs788/slides/Wang-H-Resist-Cooray.pdf · States#S(=({idle,estimating,planning,moving, failed} Running#Baum1Welch#algorithm#on#the#observation#

Configuration SelectionConfiguration  selection  as  an  optimization  problem◦ The  optimal  configuration  in  RESIST  is  defined  as  one  that  satisfies  the  system’s  reliability  requirement,  while  improving  other  quality  attributes  of  concern

◦ In  other  words,  given  the  decision  variables,𝑝% ∈ 𝛧/ represents  the  number  of  replicas  for  component  𝑖𝑥%6 ∈ [0, 1] indicates  if  component  𝑖 is  placed  on  process  𝑗

the  objective  is  to  find  an  architectural  configuration  𝐶∗ such  that,

where  𝑈S is  a  utility  function  indicating  the  preference  for  quality  attribute  𝑞𝑅(𝐶) is  the  expected  reliability  of  a  given  architecture  𝐶

𝐶∗ = 𝑎𝑟𝑔𝑚𝑎𝑥(W) X 𝑈S(𝐶)∀S  ∈  Z[\]%^_  `a6bc^%dbe

𝑠. 𝑡.          ∀𝑖 ∈ 1,… , 𝑡 , 𝑝% ≤ 𝑤%, 𝑤 ∈ 𝛧/∀𝑖 ∈ 1, … , 𝑡 , ∑ 𝑥%6i

6j0 = 1𝑅 𝐶 ≥ 𝛿, 𝛿  𝜖  ℝ, 0 < 𝛿 ≤ 1

Page 16: RESISTing)Reliability)Degradation) through)Proactive ...menasce/cs788/slides/Wang-H-Resist-Cooray.pdf · States#S(=({idle,estimating,planning,moving, failed} Running#Baum1Welch#algorithm#on#the#observation#

Configuration Selection (Cont. )Configuration  reliability  R(C)◦ Assume  the  component  may  either  be  replicated  or  share  a  process  with  other  componentsExpress  with  a  binary  variable  𝑞% = 1 if  𝑖^i component  shares  a  process;  0  if  otherwise.

𝑞% = 1−X 𝑥%6p (1 − 𝑥.6)^

.q%

i

6j0◦ Thus,  the  effective  reliability  of  component  i is,

𝑟%rss = 𝑞%𝑟%tuvwr + (1 − 𝑞%)𝑟%wrywhere,

𝑟%tuvwr =X 𝑟%𝑥%6p [𝑟.𝑥%6 + (1 − 𝑥.6)]^

.q%

z

6j0

𝑟%wry = 1 − 1− 𝑟%0/{|

◦ Finally,  the  system  reliability  can  be  computed  as  specified  in  configuration-­‐level  analysis  

Page 17: RESISTing)Reliability)Degradation) through)Proactive ...menasce/cs788/slides/Wang-H-Resist-Cooray.pdf · States#S(=({idle,estimating,planning,moving, failed} Running#Baum1Welch#algorithm#on#the#observation#

Configuration Selection (Cont. )Time-­‐complexity  analysis◦ Suppose  we  have

P =  number  of  processesC =  number  of  componentsN =  maximum  number  of  replicas

◦ This  implies  that  there𝑂(𝑃W) ways  of  allocating  components  to  processes𝑂(𝑁W) ways  replicating  components

◦ Therefore,  total  possible  configuration  is  𝑂((𝑁𝑃)W) – NP  Problem

However  the  solution  space  may  be  significantly  pruned  by  imposing  architectural  constrains  

Page 18: RESISTing)Reliability)Degradation) through)Proactive ...menasce/cs788/slides/Wang-H-Resist-Cooray.pdf · States#S(=({idle,estimating,planning,moving, failed} Running#Baum1Welch#algorithm#on#the#observation#

EvaluationImplementation◦ Mobile  emergency  response  system  prototype◦ XTEAM  is  used  to  control  system’s  operational  profile◦ Prism-­‐XM  is  used  to  gather  the  runtime  data◦ Matlab is  used  to  generate  and  solve  HMM  model

Evaluation  Criteria◦ Validity  of  reliability  predictions◦ Effectiveness  of  proactive  re-­‐configuration◦ Performance  overhead

Page 19: RESISTing)Reliability)Degradation) through)Proactive ...menasce/cs788/slides/Wang-H-Resist-Cooray.pdf · States#S(=({idle,estimating,planning,moving, failed} Running#Baum1Welch#algorithm#on#the#observation#

Evaluation (Cont. )Validity  of  Reliability  Prediction◦ Use  Bump  Probability  as  the  contextual  parameter  which  affect  the  transition  probability  from  moving state  to  estimating.

Page 20: RESISTing)Reliability)Degradation) through)Proactive ...menasce/cs788/slides/Wang-H-Resist-Cooray.pdf · States#S(=({idle,estimating,planning,moving, failed} Running#Baum1Welch#algorithm#on#the#observation#

Evaluation (Cont. )Proactive  Reconfiguration

Page 21: RESISTing)Reliability)Degradation) through)Proactive ...menasce/cs788/slides/Wang-H-Resist-Cooray.pdf · States#S(=({idle,estimating,planning,moving, failed} Running#Baum1Welch#algorithm#on#the#observation#

Evaluation (Cont. )Overhead  of  Component  Reliability  Analysis

Page 22: RESISTing)Reliability)Degradation) through)Proactive ...menasce/cs788/slides/Wang-H-Resist-Cooray.pdf · States#S(=({idle,estimating,planning,moving, failed} Running#Baum1Welch#algorithm#on#the#observation#

SummaryRESIST  is  framework  that  maintain  the  reliability  of  the  situated  software  system  through  proactive  reconfiguration  of  the  software  architecture

Three  major  components◦ Component  reliability  analysis◦ Configuration  reliability  analysis◦ Configuration  selector

Three  key  contributions◦ Incorporation  of  multiple  sources  of  information,  particularly  contextual  information◦ Automatically  find  the  optimal  architectural  configuration◦ Proactively  adapt  the  system  before  the  system’s  reliability  degrades