dsn08fastabs_laprie from dependability to resilience.pdf

2
From Dependability to Resilience Jean-Claude Laprie LAAS-CNRS — Université de Toulouse — 7, Avenue Colonel Roche 31077 Toulouse, France <[email protected]> Definition of resilience Resilience (from the Latin etymology resilire, to rebound) is literally the act or action of springing back. As a property, two strands can historically be identified: a) in social psychology [Claudel 1936], where it is about elasticity, spirit, resource and good mood, and b) and in material science, where it is about robustness and elasticity. The notion of resilience has then been elaborated: in child psychology and psychiatry [Engle et al. 1996], referring to living and developing successfully when facing adversity; in ecology [Holling 1973], referring to moving from a stability domain to another one under the influence of disturbances; in business [Hamel & Välikangas 2003], referring to the capacity to reinvent a business model before circumstances force to; in industrial safety [Hollnagel et al. 2006], referring to anticipating risk changes before damage occurrence. A common point to the above senses of the notion of resilience is the ability to successfully accommodate unforeseen environmental perturbations or disturbances. A careful examination of [Holling 1973] leads to draw interesting parallels between ecological systems and computing systems, due to: a) the emphasis on the notion of persistence of a property: resilience is said to “determine the persistence of relationships within a system and is a measure of the ability of these systems to absorb changes of state variables, driving variables, and parameters, and still persist”; b) the dissociation between resilience and stability: it is noted that “a system can be very resilient and still fluctuate greatly, i.e., have low stability” and that “low stability seems to introduce high resilience”; c) the mention that diversity is of significant influence on both stability (decreasing it) and resilience (increasing it). The adjective resilient has been in use for decades in the field of dependable computing systems, e.g. [Alsberg & Day 1976], and is more and more in use, however essentially as a synonym of fault-tolerant, thus generally ignoring the unexpected aspect of the phenomena the systems may have to face. A noteworthy exception is the preface of [Anderson 1985], which says “The two key attributes here are dependability and robustness. […] A computing system can be said to be robust if it retains its ability to deliver service in conditions which are beyond its normal domain of operation”. Fault-tolerant computing systems are known for exhibiting some robustness with respect to fault and error handling, in the above sense, i.e., for situations exceeding their specification. Examples are the tolerance of a) elusive software faults thanks to loosely-coupled architectures [Gray 1986], or of b) errors that escaped detection and thus did not trigger recovery [Kanoun et al. 1991]. This of course should not lead to forget that, contrariwise, total coverage with respect to the specified faults is hardly achievable. A total change of scale is needed when moving to the future large, networked, evolving systems constituting complex information infrastructures — perhaps involving everything from super-computers and huge server "farms" to myriads of small mobile computers and tiny embedded devices. Such systems are in fact the dawning of ubiquitous systems , and we will use this term as a shorthand for portraying our target systems With such ubiquitous systems, what is at stake is to maintain dependability, i.e., the ability to deliver service that can justifiably be trusted [Avizienis et al. 2004], in spite of continuous changes. Our definition of resilience is then: The persistence of service delivery that can justifiably be trusted, when facing changes. The definition given above builds on the initial definition of dependability, which emphasizes justifiably trusted service. In a similar spirit, the alternate definition of dependability, which emphasizes the avoidance of unacceptably frequent or severe failures, could be used, leading to an alternate definition of resilience: The persistence of the avoidance of failures that are unacceptably frequent or severe, when facing changes. From what precedes, it appears clearly that a shorthand definition of resilience is: The persistence of dependability when facing changes. The changes can be classified according to three viewpoints, or dimensions:

Upload: lcm3766l

Post on 19-Jul-2016

221 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: dsn08fastabs_laprie From Dependability to Resilience.pdf

From Dependability to Resilience

Jean-Claude LaprieLAAS-CNRS — Université de Toulouse — 7, Avenue Colonel Roche 31077 Toulouse, France

<[email protected]>

Definition of resilienceResilience (from the Latin etymology resilire, to rebound)is literally the act or action of springing back. As aproperty, two strands can historically be identified: a) insocial psychology [Claudel 1936], where it is aboutelasticity, spirit, resource and good mood, and b) and inmaterial science, where it is about robustness andelasticity.The notion of resilience has then been elaborated:• in child psychology and psychiatry [Engle et al.

1996], referring to living and developing successfullywhen facing adversity;

• in ecology [Holling 1973], referring to moving from astability domain to another one under the influence ofdisturbances;

• in business [Hamel & Välikangas 2003], referring tothe capacity to reinvent a business model beforecircumstances force to;

• in industrial safety [Hollnagel et al. 2006], referringto anticipating risk changes before damage occurrence.

A common point to the above senses of the notion ofresilience is the ability to successfully accommodateunforeseen environmental perturbations or disturbances.A careful examination of [Holling 1973] leads to drawinteresting parallels between ecological systems andcomputing systems, due to:a) the emphasis on the notion of persistence of a

property: resilience is said to “determine thepersistence of relationships within a system and is ameasure of the ability of these systems to absorbchanges of state variables, driving variables, andparameters, and still persist”;

b) the dissociation between resilience and stability: it isnoted that “a system can be very resilient and stillfluctuate greatly, i.e., have low stability” and that“low stability seems to introduce high resilience”;

c) the mention that diversity is of significant influenceon both stability (decreasing it) and resilience(increasing it).

The adjective resilient has been in use for decades in thefield of dependable computing systems, e.g. [Alsberg &Day 1976], and is more and more in use, howeveressentially as a synonym of fault-tolerant, thus generallyignoring the unexpected aspect of the phenomena thesystems may have to face. A noteworthy exception is thepreface of [Anderson 1985], which says

“The two key attributes here are dependability androbustness. […] A computing system can be said to berobust if it retains its ability to deliver service inconditions which are beyond its normal domain ofoperation”.Fault-tolerant computing systems are known forexhibiting some robustness with respect to fault and errorhandling, in the above sense, i.e., for situationsexceeding their specification. Examples are the toleranceof a) elusive software faults thanks to loosely-coupledarchitectures [Gray 1986], or of b) errors that escapeddetection and thus did not trigger recovery [Kanoun et al.1991]. This of course should not lead to forget that,contrariwise, total coverage with respect to the specifiedfaults is hardly achievable.A total change of scale is needed when moving to thefuture large, networked, evolving systems constitutingcomplex information infrastructures — perhaps involvingeverything from super-computers and huge server "farms"to myriads of small mobile computers and tiny embeddeddevices. Such systems are in fact the dawning ofubiquitous systems, and we will use this term as ashorthand for portraying our target systemsWith such ubiquitous systems, what is at stake is tomaintain dependability, i.e., the ability to deliver servicethat can justifiably be trusted [Avizienis et al. 2004], inspite of continuous changes.Our definition of resilience is then:

The persistence of service delivery that canjustifiably be trusted, when facing changes.

The definition given above builds on the initial definitionof dependability, which emphasizes justifiably trustedservice. In a similar spirit, the alternate definition ofdependability, which emphasizes the avoidance ofunacceptably frequent or severe failures, could be used,leading to an alternate definition of resilience:

The persistence of the avoidance of failures thatare unacceptably frequent or severe, when facingchanges.

From what precedes, it appears clearly that a shorthanddefinition of resilience is:

The persistence of dependability when facingchanges.

The changes can be classified according to threeviewpoints, or dimensions:

Page 2: dsn08fastabs_laprie From Dependability to Resilience.pdf

• Their nature, which can be functional, environmental,or technological, where the latter can concern either orboth hardware and software.

• Their prospect, which can be:- foreseen, as in new versioning,- foreseeable, as in the advent of new hardware

platforms,- unforeseen, as drastic changes in service requests or

new types of threats.• Their timing, which can be:

- short term, e.g., seconds to hours, as indynamically changing systems (spontaneous, or‘ad-hoc’, networks of mobile nodes and sensors,etc.),

- medium term, e.g., hours to months, as in newversioning or reconfigurations,

- long term, e.g., months to years, as inreorganizations resulting from merging of systemsin company acquisitions, or from coupling ofsystems in military coalitions.

It has to be emphasized, in the context of dependability,that the changes can concern, or induce changes in thethreats the system is facing. The threat changes can havetheir source in the changes to the system or itsenvironment, taken either a) in isolation, such as, fortechnological changes, the ever increasing proportion oftransient hardware faults that goes along with the progressof integration, or b) in combination, such as the ever-evolving and growing problem of attacks both by amateurhackers and by professional criminals, that may resultfrom environmental and technological changes. Finally,the changes can themselves turn into threats, as in thecase of mismatches between the modifications thatimplement the changes and the former status of thesystem.Figure 1 summarizes schematically the classes ofchanges.

Figure 1 - Changeclassification

Changes

Nature

Short termFunctional Foreseen

Medium termEnvironmental Foreseeable

Long termTechnological Unforeseen

Prospect Timing

Threat changes

Technologies for resilienceAs ubiquitous systems are under continuous changes orevolutions, a central property they should exhibit, viaappropriate technology, is evolvability, i.e., the abilityto successfully accommodate changes. Withinevolvability, an important topic is adaptivity, i.e., thecapability of evolving while executing.As our definition of resilience retains the notion ofjustified confidence, assessability, in both senses ofverification and evaluation, comes immediately second.Classically, verification and evaluation are performed off-line, pre-deployment. Such an approach falls obviouslyshort in the case of evolving systems, for which

assessment has to be performed at run-time, duringoperation.Computing systems have already pervaded all activitiesof our life, and this will still be even more true withubiquitous systems, hence the importance of usability.Ubiquitous systems being highly complex systems,heterogeneity and diversity are naturally present.Diversity can, and should be taken advantage of in orderto prevent vulnerabilities to become single points offailure.Those four technologies for resilience are of course relatedto the means for dependability, as shown on Figure 2.

Evolvability

Assessability

Fault Removal ● ●

Fault Forecasting ● ●

Usability

Fault Prevention ● ●

Diversity

Fault Tolerance ● ● ●

Technologies for Resilience

Means forDependability

Figure 12- Relationship between the technologies for resilience andthe means for dependability

AcknowledgmentsThe work reported here has been performed in theframework of the European Network of ExcellenceReSIST (Resilience for Survivability in InformationSociety Technologies). The work has benefited fromdiscussions with several ReSIST members. Commentsby Karama Kanoun and Brian Randell on former versionsof the text are especially acknowledged.

References[Alsberg & Day 1976] P.A. Alsberg, J.D. Day, "A principle forresilient sharing of distributed resources", Proc. 2nd Int. Conf. onSoftware Engineering, San Francisco, Oct. 1976, pp. 562-570.[Anderson 1985] T. Anderson (Ed.), Resilient Computing Systems,Collins, 1985.[Avizienis et al. 2004] A. Avizienis, J.C. Laprie, B. Randell, C.Landwehr, "Basic Concepts and Taxonomy of Dependable and SecureComputing", IEEE Transactions on Dependable and Secure Computing,Vol.1, No.1, Jan-March 2004, pp. 11-33.[Claudel 1936] P. Claudel, "The American elasticity", in Works inProse, La Pleiade, Gallimard, Paris, 1965, pp. 1204-1208; in French.[Engle et al. 1996] P.L. Engle, S. Csatle, P. Menon, "Childdevelopment: vulnerability znd resilience", Social Science andMedicine, vol. 43, no. 5, 1996, pp. 621-635.[Gray 1986] J.N. Gray, “Why do computers stop and what can bedone about it?”, Proc. 5th Symp. on Reliability in Distributed Softwareand Database Systems, Los Angeles, Jan. 1986, pp. 3-12.[Hamel & Välikangas 2003] G. Hamel, L. Välikangas, "The quest forresilience", Harvard Business Review, Sept. 2003.[Holling 1973] C.S. Holling, "Resilence and stability of ecologicalsystems", Annual Review of Ecology and Systematics, vol. 4, 1973, pp.1-23.[Hollnagel et al. 2006] E. Hollnagel, D. Woods, N. Leveson (Eds.),Resilience Engineering – Concepts and Precepts, Ashgate, 2006.[Kanoun et al. 1991] K. Kanoun, J. Arlat, L. Burrill, Y. Crouzet, S.Graf, E. Martins, A. MacInnes, D. Powell, J.L. Richier, J. Voiron,"Validation", ch. 15 of Delta-4: a generic architecture for dependabledistributed computing, D. Powell (Ed.), Reearch Reports Esprit,Springer-Verlag, 1991, pp. 395-406.