rag – resilience analysis gridresilience analysis grid 30/12/08 rag – resilience analysis grid...

20
Resilience Analysis Grid 30/12/08 RAG – Resilience Analysis Grid Technical Document prepared by the Industrial Safety Chair, January 2009 Introduction Resilience is today recognised as an important quality of an organisation or a system 1 and describes the systems ability to adjust its functioning prior to, during, or following changes and disturbances, so that it can sustain required operations even after a major mishap or in the presence of continuous stress. Resilience Engineering does not see failures as a breakdown or malfunctioning of normal system functions, but rather as the converse of the adjustments required to cope with the unpredictability of the real world. Resilience Engineering, as a practical discipline, looks for ways to enhance the ability of systems to succeed under varying conditions. This means, more specifically, the ability to respond effectively to disruptions or ongoing production and economic pressures, to monitor threats and revise risk models, to anticipate future threats, disruptions and other destabilizing conditions, and to learn from past events, to understand correctly both what happened and why. Safety as a Quality A system is usually considered safe if the number of adverse outcomes is acceptably small. 2 Adverse outcomes are typically accidents and incidents, but may also include work time injury, work related illnesses, etc. Adverse outcomes are counted per safety unit, which can be characteristic operations or specific durations (cf., Amalberti, 2002). The level of safety is therefore measured by the number of such outcomes per safety unit, and the common interpretation is that higher safety corresponds to a smaller number of outcomes. One example of that is the following definition: Safety is the state in which the risk of harm to persons or of property damage is reduced to, and maintained at or below, an acceptable level through a continuing process of hazard identification and risk management. (ICAO, 2006). Resilience engineering takes a broader look and defines safety as the ability to succeed under varying conditions. This clearly includes the traditional definition of safety, since it is a consequence of resilience that there will be fewer adverse outcomes. But resilience also has the system’s ability to function broadly in focus. Resilience engineering is about the operations necessary for the system’s continued existence and growth, hence address core 1 In this technical note, the terms ‘organisation’ and ‘system’ will be used synonymously, even though they are not synonyms strictly speaking. 2 It is significant that safety is defined in terms of outcomes rather than in terms of, e.g., events. This was noted by Heinrich (1959) who made a clear distinction between the accident as an event and the injury as the outcome. E. Hollnagel Page 1

Upload: others

Post on 10-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RAG – Resilience Analysis GridResilience Analysis Grid 30/12/08 RAG – Resilience Analysis Grid Technical Document prepared by the Industrial Safety Chair, January 2009 Introduction

Resilience Analysis Grid 30/12/08

RAG – Resilience Analysis GridTechnical Document prepared by the Industrial Safety Chair,

January 2009

IntroductionResilience is today recognised as an important quality of an organisation or a system1 anddescribes the systems ability to adjust its functioning prior to, during, or following changesand disturbances, so that it can sustain required operations even after a major mishap or in thepresence of continuous stress. Resilience Engineering does not see failures as a breakdown ormalfunctioning of normal system functions, but rather as the converse of the adjustmentsrequired to cope with the unpredictability of the real world. Resilience Engineering, as apractical discipline, looks for ways to enhance the ability of systems to succeed under varyingconditions. This means, more specifically, the ability to respond effectively to disruptions orongoing production and economic pressures, to monitor threats and revise risk models, toanticipate future threats, disruptions and other destabilizing conditions, and to learn from pastevents, to understand correctly both what happened and why.

Safety as a Quality

A system is usually considered safe if the number of adverse outcomes is acceptably small.2

Adverse outcomes are typically accidents and incidents, but may also include work timeinjury, work related illnesses, etc. Adverse outcomes are counted per safety unit, which can becharacteristic operations or specific durations (cf., Amalberti, 2002). The level of safety istherefore measured by the number of such outcomes per safety unit, and the commoninterpretation is that higher safety corresponds to a smaller number of outcomes. One exampleof that is the following definition:

Safety is the state in which the risk of harm to persons or of property damage is reducedto, and maintained at or below, an acceptable level through a continuing process ofhazard identification and risk management. (ICAO, 2006).

Resilience engineering takes a broader look and defines safety as the ability to succeed undervarying conditions. This clearly includes the traditional definition of safety, since it is aconsequence of resilience that there will be fewer adverse outcomes. But resilience also hasthe system’s ability to function broadly in focus. Resilience engineering is about theoperations necessary for the system’s continued existence and growth, hence address core

1 In this technical note, the terms ‘organisation’ and ‘system’ will be used synonymously, even though they arenot synonyms strictly speaking.

2 It is significant that safety is defined in terms of outcomes rather than in terms of, e.g., events. This was notedby Heinrich (1959) who made a clear distinction between the accident as an event and the injury as theoutcome.

E. Hollnagel Page 1

Page 2: RAG – Resilience Analysis GridResilience Analysis Grid 30/12/08 RAG – Resilience Analysis Grid Technical Document prepared by the Industrial Safety Chair, January 2009 Introduction

Resilience Analysis Grid 30/12/08

business (productivity, quality, and effectiveness) as well as safety. This has consequencesboth for how safety is measured and for how safety is managed.

The Four Cornerstones of Resilience

The key term in the definition of resilience is the ability of a system to adjust how the itfunctions. This makes clear that resilience is not just the ability to continue functioning in thepresence of stress and disturbances. While the ability of a system to preserve and sustain itsprimary functions is important, this can be achieved by other and more traditional means.Continued functioning can for instance be achieved by isolating the system from theenvironment, or by making it impervious to exogenous disturbances. An example of that is thedefence-in-depth principle, which means that there are multiple layers of barriers between thesystem and the environment in which it exists.

Adjustments can in principle take place either after something has happened (be reactive) orbefore something happens (be proactive). Reactive adjustments are by far the most common.For instance, if there is a major accident in a community, such as a large fire or an explosion,local hospitals will change their state of functioning and prepare for a rush of people that mayhave been hurt. Responding when something has happened may, however, be insufficient toguarantee the system’s safety and survivability. One reason is that a system can only be readyto respond to a limited set of events or conditions, either because it only recognises a certainset of symptoms or because it only has the resources needed for some kinds of events but notfor others – and usually only for a limited duration. Vivid examples of that can be found ineveryday events, the most conspicuous case in recent years being the unfortunate lack ofresponse by the Federal Emergence Management Agency (FEMA) to the Hurricane Katrina in2005 (e.g., Comfort & Haase, 2006). In the world of business, the failure of Airbus companyto recognise and effectively respond to the problems with the production of the A380 in theJune 2006, and the later failure of the Boeing company to do the same with the production ofthe 787 in September 2007, suggest that limited readiness is not an unusual phenomenon atall.The ability to adjust after something has happened relies on the experiences from past events,not only to establish a specific readiness but also to make decisions about structural orfunctional changes that may make the system better prepared for what can happen in thefuture. These changes are often directed at the causes, as determined by accidentinvestigations, although such causes and explanations always must be seen relative to theaccident models and the investigation methods that were used (Hollnagel, 2004; Hollnagel &Speziali, 2008).

Making adjustments prior to an event means that the system can change from a state ofnormal functioning to a state of heightened readiness before something happens. A state ofreadiness means that resources are allocated to match the needs of the expected event, thatspecial functions are activated, and that defences are increased. A trivial example is to battendown the hatches to prepare for stormy weather, either literally or metaphorically. Aneveryday example from the world of aviation is to secure the seat belts before start and

E. Hollnagel Page 2

Page 3: RAG – Resilience Analysis GridResilience Analysis Grid 30/12/08 RAG – Resilience Analysis Grid Technical Document prepared by the Industrial Safety Chair, January 2009 Introduction

Resilience Analysis Grid 30/12/08

landing or during turbulence. In these cases, the criteria for changing from a normal state to astate of readiness are clear. In other cases it may be less obvious either because of a lack ofexperience or because the validity of indicators is questionable. An increased state of alertnessshould, of course, not last longer than necessary since it may consume resources thatotherwise could be used for normal performance.

The working definition of resilience can be made more detailed by pointing to fourcornerstones of resilience, each representing an essential capability, cf., Figure 1:

• Knowing what to do, i.e., how to respond to regular and irregular disruptions anddisturbances by adjusting normal functioning. This is the ability to address the actual.

• Knowing what to look for, i.e, how to monitor that which is or could become a threat inthe near term. The monitoring must cover both that which happens in the environmentand that which happens in the system itself, i.e., its own performance. This is the abilityto address the critical.

• Knowing what to expect, i.e., how to anticipate developments and threats further intothe future, such as potential disruptions, pressures, and their consequences. This is theaddress to address the potential.

• Knowing what has happened, i.e., how to learn from experience, in particular to learnthe right lessons from the right experience. This is the ability to address the factual.

Basic Requirements to Manage Something

In order to manage a system – indeed, in order to manage anything – three requirements mustbe met. First, it is necessary to know the current status or present position. Second, it isnecessary to have a clear idea about what the future status or position should be. And third, itis necessary to know by which means an effective change from the present position to thefuture position can be made.

E. Hollnagel Page 3

Figure 1: The four cornerstones of resilience

Page 4: RAG – Resilience Analysis GridResilience Analysis Grid 30/12/08 RAG – Resilience Analysis Grid Technical Document prepared by the Industrial Safety Chair, January 2009 Introduction

Resilience Analysis Grid 30/12/08

The problem of management, which in practice is equivalent to the problem of control, can ingeneral be expressed as follows, where each of the three requirements corresponds to aspecific question:

• How X are we? (Alternatively: how much of X do we have?)

• How X should we be? (Alternatively: how much of X should we have?)

• How do we become more / less X, or how do we get more / less of X?

The quality X can be many things, as the following two simple examples will show. In thecase of safety, the three questions become:

• How safe is our operation or how safe is our department / company / primary process /etc.?

• How safe should our operation or system be, and when (by the end of this year / nextyear/ five years from now, and so on.)?

• Which means do we have that can make our operation or system safer, and how shouldwe apply them?

Or, to take a more topical example, the questions for financial exposure could look like this:

• How exposed are we to the volatility of the stock market?

• How exposed should we be to the volatility of the stock market?

• How do we become less exposed to the volatility of the stock market?

(Notice that in the case of safety, the desired change is to increase something, where as in thecase of financial exposure the desired change is to reduce something. The direction of changedepends on how the second requirement is formulated.)Process and safety management traditionally put much effort into meeting the firstrequirement, i.e., how something can be measured. But in order to be in control of somethingit is clearly necessary to know the direction in which change should be made, whether itshould be towards something rather than away from something, and to have the means (tools,techniques) effectively to make the change. The present note will for practical reasons focuson the first requirement, but it is planned to address the two other requirements at a later time.

E. Hollnagel Page 4

Page 5: RAG – Resilience Analysis GridResilience Analysis Grid 30/12/08 RAG – Resilience Analysis Grid Technical Document prepared by the Industrial Safety Chair, January 2009 Introduction

Resilience Analysis Grid 30/12/08

The Measurement Problem The first requirement raises the problem of measurements, specifically of performancemeasurements or performance indicators. This involves a number of commonly recognisedissues, such as:

• Can the status be expressed by a single measurement or does it require a calculation,i.e., a combination of several measurements?

• Do the measurements represent lagging indicators or leading indicators, i.e., is themeasurement of a condition in the past or is it indicative of a position in the future?

• Are the measurements reliable and valid?

• Are the measurements well-defined?

• Are the measurements objective or subjective, e.g., can they be made automatically andby technological means or do they rely on the judgement or opinions of people?

A measurement is normally understood as a quantity of something, for instance a value or anumber, and is normally a scalar rather than a vector. But a number in itself is meaninglessunless it can be set in relation to something or some context. In other words, quantities ornumbers have to be interpreted. This can only be done by referring to a commonunderstanding or a set of conventions. Consider, for instance, the following example:

52 The number 52 taken by itself does not mean anything. Indeed, it could be a symbol as well as a number.

52 kilos The number has now got a unit, so at least we know what the number represents, namely a weight of something. But we do not know whether it is 52 kilos of CO2 (which is how much a common car produces when driven for 350 km), or 52 kilos of cheese.

Pierre weighs 52 kilos

The added information, i.e., the name of a person, makes the number even more meaningful. We might even begin to make assumptions about whether this weight indicates a normal condition or an unusual condition.

Pierre is 12 years old and weighs 52 kilos

Because of the additional information, the number finally makes some sense. Even without being a physician we can say that the weight is not normal, and that Pierre possibly is obese.

E. Hollnagel Page 5

Page 6: RAG – Resilience Analysis GridResilience Analysis Grid 30/12/08 RAG – Resilience Analysis Grid Technical Document prepared by the Industrial Safety Chair, January 2009 Introduction

Resilience Analysis Grid 30/12/08

The final formulation does not only provide the necessary context to understand what thenumber means, but also invokes a frame of reference that can be used to plan what to do as aconsequence of the measurement.3

Measurements of Safety

As mentioned above, safety is usually defined as the absence of unwanted events. The level ofsafety is consequently measured by the number of such events per safety unit. As a simpleexample of that, consider the top five HSE4 indicators used by the oil industry.

1. (Number of) fatal accidents2. Total recordable injury frequency (TRIF)

3. Lost-time injury frequency (LTIF)4. Serious HSE incident frequency (SIF)

5. Accidental oil spill (number and volume)

The European Technology Platform on Industrial Safety uses the same approach. The ETPISdoes acknowledge that “Safety is ... a key factor for successful business and an inherentelement of business performance,” and then continues:

... (I)ndustrial safety performance will have progressively and measurably improved interms of reduction of reportable accidents at work, occupational diseases, environmentalincidents and accident-related production losses.

It is quite understandable that safety traditionally has focused on adverse outcomes, sincethese represent situations that any organisation would want to avoid. Adverse outcomes arealso phenomena that by their very nature attract attention both in terms of their direct effects(loss of life, property, and money) and in terms of their indirect effects (disruption offunctions and production, need of recovery operations, restoration, etc.).Seen as measurements, adverse outcomes have two further advantages. The first is that theyare countable, the second that they are (relatively) unambiguous.5 They are also helpfultowards meeting the second requirement, since it clearly is a step in the right direction toreduce the number of adverse outcomes, i.e., to reduce their number. But measurements ortallies of adverse outcomes are of little help to meet the third requirement, i.e., howeffectively to make a change in the desired direction.

3 This can, of course, involve not doing anything, if, e.g., the value was considered normal.

4 HSE = Health, Safety, security and Environment

5 Both advantages require that the outcomes are clearly defined in operational terms. In practice there may belarge cultural and national differences in such definitions.

E. Hollnagel Page 6

Page 7: RAG – Resilience Analysis GridResilience Analysis Grid 30/12/08 RAG – Resilience Analysis Grid Technical Document prepared by the Industrial Safety Chair, January 2009 Introduction

Resilience Analysis Grid 30/12/08

One problem with measuring safety by the number of adverse outcomes or unwanted events(accidents, incidents, etc.) is that it works best in the beginning, when safety is low, but not sowell later, when safety is high. The reason is simply that if the number of adverse events hasbeen effectively reduced, there is little to measure, hence little to show for the efforts made.Consider, for instance, the Strategic Research Agenda of ETPIS (2005). The explicit objectiveis “to achieve a 25% reduction in accidents by 2020 and to have programmes in place by 2020to continue accident reduction at a rate of 5% per year or better.” Assuming that these goalscan be achieved, it does not require many calculations to realise that the proposedmeasurement at some time in the not too distant future will have reached an asymptote, hencewill become useless.

Difference between Safety and Resilience

The difference between safety and resilience makes a difference in how the two qualities canbe measured. Safety is normally measured by the number of adverse events, where someexamples have been provided above. This means that safety is measured by the product of aprocess, in this case the process of safety management. As any process, this will have twotypes of outcomes, the intended outcomes and the unintended outcomes. Whereas a normalfeedback-controlled process used measurements of the intended outcomes as a basis forregulation, the safety management uses measurements of the unintended outcomes, i.e., theaccidents, incidents, etc.6

Since resilience refers to the system’s performance, the measurements should be ofperformance (process) rather than outcomes (products). This has two advantages. First, that

6 From a regulation point of view this is probably not the most efficient way of controlling the process.

E. Hollnagel Page 7

Figure 2: Safety measured by adverse outcomes

Page 8: RAG – Resilience Analysis GridResilience Analysis Grid 30/12/08 RAG – Resilience Analysis Grid Technical Document prepared by the Industrial Safety Chair, January 2009 Introduction

Resilience Analysis Grid 30/12/08

the measurements deal equally with successes and failures, or rather with the processesunderlying successes and failures. This supports the Resilience Engineering view of failuresas the flip side of successes, hence as being the outcomes of the same underlying processes.Second that the measurements are direct measurements of the process, and not indirectmeasurements of products. The traditional approach to safety requires an interpretation ofwhat is measured (adverse outcomes) in terms of the possible underlying process – and morespecifically the underlying process failure. Resilience makes that interpretation unnecessaryby letting the measurement be of the process directly.7

Measurements of Resilience

Since resilience is defined by the system’s ability to adjust its functioning, it follows that ameasure of resilience must be different from the traditional measures of safety. Sinceresilience refers to a quality rather than a quantity, something that the system does rather thansomething that the system has, it is not possible to point to any single or simple measurement.The solution is instead to consider the four capabilities that together define resilience, and onthat basis develop a Resilience Analysis Grid, i.e., four sets of questions where the answerscan be used to construct a resilience profile of a system or an organisation. The rest of thereport will present an outline of such a Resilience Analysis Grid.

7 From a methodological point of view it is, of course, only possible to measure a process by its products. Butit makes a difference whether the ‘products’ are indicators of system states, which requires that the system isactive, or whether they are of the outcomes as they exist independently of the process.

E. Hollnagel Page 8

Figure 3: Direct and indirect measures of safety

Page 9: RAG – Resilience Analysis GridResilience Analysis Grid 30/12/08 RAG – Resilience Analysis Grid Technical Document prepared by the Industrial Safety Chair, January 2009 Introduction

Resilience Analysis Grid 30/12/08

The ability to respond

No system – be it an individual, a group, or an organisation – can sustain its functioning andcontinue to exist unless it is able to respond to what happens. The response must furthermorebe both timely and effective so that it can bring about the desired outcome before it is too late.As described above, a resilient system responds by adjusting its functioning so that it bettermatches the new conditions. Other responses are to mitigate the effects of an adverse event, toprevent a further deterioration or spreading of effects, to restore the state that existed beforethe event or to resume the functioning that existed before, to change to stand-by conditions,etc. In order to respond when something happens the system must be able to detect that somethinghas happened. Second, it must be able to identify the event and recognise or rate it as being soserious that a response is necessary. Third, the system must know how to respond and becapable of responding, in particular it must have or be able to command the requiredresources long enough for the response to have an effect.

The detection that something has happened is not entirely passive but depends on what thesystem looks for – on what its pre-defined categories of critical events or threats are. If thesystem looks for the wrong events or threats it may either fail to recognise some threats (amiss) or respond to situations where a response was not actually needed (a false alarm). Theformer will leave the system vulnerable to unexpected events. The latter may be harmful bothbecause the system may transition to a state that is not easily reversible, and because it wastesresources and reduces readiness. Some events may be so obvious that they cannot be missed, yet without any response beingready – or even without a clear idea of what should be done. (The subprime crisis of 2007 wasan example of that.) In such cases there may also be an urgency of the situation, i.e., animmediate pressure to act, which by its very nature limits the ability to consider what theproper response should be. Under such conditions the system may easily lose control byresponding in an opportunistic or scrambled rather than in a more orderly mode (Hollnagel,1998).

If an event or a threat is rated as serious, the response can either be to change to a state ofreadiness, or to take action in the concrete situation. In the first case, deciding that a statechange is needed depends on a number of factors, cultural, organisational, and situational. Thedilemma is nicely captured by the common definition of safety as the freedom fromunacceptable risks, which forces the question of how large a risk is considered acceptable –and by whom. A common solution is to rely on probability calculations, and accept all riskswhere the probability is lower than some numerically defined limit (e.g., Amalberti, 2006).This, however, does not solve the problem of how the limit is set.8 The second case is how to decide whether a response should be activated in a given situation.As long as the activation of the response depends on technology, including software, theproblem is in principle solvable. But in cases where the decision depends on humans, theproblem is more difficult. Deciding whether to do something, and when to do it, depends to a

8 It also requires that the risk can be accurately measured.

E. Hollnagel Page 9

Page 10: RAG – Resilience Analysis GridResilience Analysis Grid 30/12/08 RAG – Resilience Analysis Grid Technical Document prepared by the Industrial Safety Chair, January 2009 Introduction

Resilience Analysis Grid 30/12/08

considerable extent on the competence of the people involved, and on the situation in whichthey find themselves (e.g., Dekker & Woods, 1999).

Finally, having the resources necessary for the chosen response is also essential. This is notonly a question of having prepared resources, which really only makes sense for regularthreats (cf., below), but also a question of whether the system is flexible enough to make thenecessary resources available when needed.

Analysis item (ability to respond) Score or evaluation

Event list: What are the events for which the system has a prepared response?

Background: How were these events selected (experience, expertise, risk assessment, etc.?

Relevance: When was the list created? How often is it revised? On which basis is it revised?

Threshold: When is a response activated? What is the triggering criterion or threshold? Is the criterion absolute or does it depend on internal / external factors?

Response list: How was the specific type of response decided? How is it ascertained that it is adequate? (Empirically, or based on analyses or models?)

Speed: How fast is full response capability available?

Duration: For how long can a 100% effective response be sustained?

Stop rule: What is the criterion for returning to a “normal” state?

Response capability: How many resources are allocated to the response readiness (people, materials)? How many are exclusive for the response potential?

Verification: How is the readiness to respond maintained? How isthe readiness to respond verified?

The ability to monitor (keeping an eye on critical developments)

A resilient system must be able flexibly to monitor what is going on, including its ownperformance. The ability to monitor enables the system to cope with that which could becomecritical in the near term. In order for the monitoring to be flexible, its basis must be assessedand revised from time to time. As argued above, it is in practice only possible for a system to be ready to respond to regularthreats, or even just to a subset of these. It is nevertheless a potential risk if the readiness torespond is limited to too small a number of events or conditions. The solution is to monitorfor things that may become critical, and use that to change from a state of normal operation to

E. Hollnagel Page 10

Page 11: RAG – Resilience Analysis GridResilience Analysis Grid 30/12/08 RAG – Resilience Analysis Grid Technical Document prepared by the Industrial Safety Chair, January 2009 Introduction

Resilience Analysis Grid 30/12/08

a state of readiness when the conditions indicate that a crisis, disturbances, or failure isimminent. Such as two-step approach will be more cost effective. If a system can make itselfready when something is going to happen, rather than remain in a state of readiness more orless permanently, then resources may be freed for more productive purposes. The difficulty is,of course, to be able to decide that something may go wrong so early that there is sufficienttime to change to a state of readiness. It is also necessary that the identification of theimpending event is so reliable that preparations are not made in vain. There will of coursealways be situations that completely defy both preparations and monitoring – the dreadedunexampled events – but more can be done to reduce their number and frequency ofoccurrence than established safety practices allow.

Monitoring normally looks for certain conditions or relies on certain indicators. These are bydefinition called leading indicators, because they indicate what may happen before it happens.Everyday life is replete with examples, as the indicators for the weather tomorrow or for thecoming winter (or summer). In the case of the weather there are good leading indicatorsbecause we have an accurate understanding of the phenomenon, i.e., of how the (weather)system functions. In other cases, and particularly in safety related cases, we only have weakor incomplete descriptions of what goes on and therefore have no effective way of proposingor defining valid leading indicators. Because of this, most systems rely on lagging indicatorsinstead, such as accident statistics. While many lagging indicators have a reasonable facevalidity, they are only known with a delay that often may be quite considerable (e.g., annualstatistics). The dilemma of lagging indicators is that while the likelihood of success increasesthe smaller the lag is (because early interventions are more effective than late ones), thevalidity or certainty of the indicator increases the longer the lag (or sampling period) is.

Analysis item (ability to monitor) Score or evaluation

Indicator list: How have the indicators been defined? (By analysis, by tradition, by industry consensus, by the regulator, by international standards, etc.)

Relevance: How often is the list of indicators revised, and on whatbasis?

Indicator type: How many of the indicators are leading, and how many are lagging?

Validity: For leading indicators, how is their validity established?

Delay: For lagging indicators, how long is the lag?

Measurement type: What is the nature of the “measurements”? Qualitative or quantitative? (If quantitative, what kind of scaling is used?)

Measurement frequency: How often are the measurements made? (Continuously, regularly, every now and then?)

E. Hollnagel Page 11

Page 12: RAG – Resilience Analysis GridResilience Analysis Grid 30/12/08 RAG – Resilience Analysis Grid Technical Document prepared by the Industrial Safety Chair, January 2009 Introduction

Resilience Analysis Grid 30/12/08

Analysis item (ability to monitor) Score or evaluation

Analysis: What is the delay between measurement and analysis/interpretation? How many of the measurements are directly meaningful and how many require analysis of some kind?

Stability: Are the effects measured transient or permanent?

Organisational support: Is there a regular inspection scheme or schedule? Is it properly resourced?

The ability to anticipate (looking for future threats and opportunities)

While looking for what may go wrong in the immediate future generally makes sense, it maybe less obvious that there is an advantage to look at the more distant future. The differencebetween monitoring and looking ahead is both that the time horizons are different (shortversus long), and also that it is done in different ways. In monitoring, a set of pre-defined cuesor indicators are checked to see if they change in a way that may demand a response. Inlooking for the potential, the goal is to identify possible future events, conditions, or statechanges – internal or external to the system – that may threaten the system’s ability tofunction. While monitoring tries to keep an eye on the regular threats, looking for thepotential tries to identify the most likely irregular threats.While risk assessment already does look for the potential, it is constrained because it relies onrepresentations of linear combinations of discrete events, such as event trees and fault trees.Established risk assessment methods are developed for tractable systems where the principlesof functioning are known, where descriptions do not contain too many details, wheredescriptions can be made relatively quickly, and where the system does not change while thedescription is being made (Hollnagel, 2008). For such systems it may be acceptable to lookfor the failure potential in simple combinations of discrete events or linear extrapolations ofthe past. Many present day systems of major interest for industrial safety are unfortunately notlike that. This means that the principles of functioning are only partly or incompletely known,that the description is elaborate and contains many details, that it takes a long time to make,and that the system therefore changes while the description is made. In consequence of thatthere will never be a complete description of the system and it is therefore ill advised to relyon established risk assessment methods.

Looking for the potential requires requisite imagination or the ability to imagine key aspectsof the future (Westrum, 1993). As described by Adamski & Westrum (2003), requisiteimagination is needed to know from which direction trouble is likely to arrive and to explorethose factors that can affect outcomes in future contexts. The relevance of doing that isunfortunately not always accepted, since it requires resources that could have been used for,e.g., production. Even if the possibility that something could go wrong is acknowledged, thinking about thepotential is fraught with difficulties. Many studies have, for instance, shown that humanthinking makes use of a number of simplifying heuristics such as representativeness, recency,

E. Hollnagel Page 12

Page 13: RAG – Resilience Analysis GridResilience Analysis Grid 30/12/08 RAG – Resilience Analysis Grid Technical Document prepared by the Industrial Safety Chair, January 2009 Introduction

Resilience Analysis Grid 30/12/08

and anchoring (Tversky & Kahneman, 1974). While these may improve efficiency in normalworking conditions, they severely restrict the more open-minded thinking that is necessary tolook at the possible. Looking for the potential is also difficult because it requires a disciplinedcombination of individual or collective imagination. It can also be costly, both because itcannot be hurried but must take its time and because it deals with something that may happenso far into the future that benefits are rather uncertain. Relatively few systems thereforeallocate sufficient resources to look at the potential. However, a truly resilient system realisesthe need at least to do something.

Analysis item (ability to anticipate) Score or evaluation

Expertise: What kind of expertise is relied upon to look into the future? (In-house, outsourced?)

Frequency: How often are future threat and opportunities assessed?

Communication: How are the expectations about future events communicated or shared within the organisation?

Strategy: Does the organisation have a clearly formulated ‘model of the future’?

Model: Is the model explicit or implicit? Qualitative or quantitative?

Time horizon: How far ahead does the organisation plan? Is the time horizon different for business and safety?

Acceptability: Which risks are considered acceptable and which unacceptable? On which basis?

Aetiology: What is the assumed nature of future threats?

• Same as previous threats/accidents?

• Combination/extrapolation of known accidents / incidents?

• Completely novel threats?

Culture: Is risk awareness part of the organisational culture?

The ability to learn (finding and making use of the right experience)

A resilient system must be able to learn from experience. Although this is mentioned last, it isin many ways the basis for the ability to respond, to monitor, and to look ahead. To learn fromexperience sounds rather straightforward and few safety managers, administrators, orregulators will disagree with that. Yet if it is to be done in an efficient and systematic manner,it requires careful planning and ample resources. The effectiveness of learning depends onwhat the basis for the learning is, i.e., which events or experiences are taken into account; onhow the events are analysed and understood; and on when and how often the learning takesplace.

E. Hollnagel Page 13

Page 14: RAG – Resilience Analysis GridResilience Analysis Grid 30/12/08 RAG – Resilience Analysis Grid Technical Document prepared by the Industrial Safety Chair, January 2009 Introduction

Resilience Analysis Grid 30/12/08

In learning from experience it is important to separate what is easy to learn from what ismeaningful to learn. Experience is often couched in terms of the number or frequency ofoccurrence of some event or other, usually ones that are negative (accidents, incidents, losstime, etc.). But counting is not the same as learning. In order for a measure to be useful, itmust be meaningful, hence refer to a principle, a model, or some kind of conceptual basis.While compiling extensive accident statistics may seem impressive it does not mean that thesystem actually learns anything. Knowing how many accidents have occurred says nothingabout why they have occurred, nor anything about the many situations when accidents did notoccur. And without knowing why accidents occur, as well as knowing why they do not occur,it is impossible to propose effective ways to improve safety.

Analysis item (ability to learn) Score or evaluation

Selection criteria: Which events are investigated and which are not? How is the selection made? Who makes the selection?

Learning basis: Does the organisation try to learn form successes as well as from failures?

Classification: How are events described? How are data collected and categories?

Formalisation: Are there any formal procedures for investigation and learning?

Training: Is there any formal training or organisational support for investigation and learning?

Learning style: Is learning a continuous or discrete (event-driven)activity?

Resources: How many resources are allocated to investigation and learning? Are they adequate?

Delay: What is the delay in reporting and learning? How are the outcomes communicated within and without the organisation?

Learning target: On which level does the learning take effect? (individual, collective, organisational)

Rating ResilienceThe four sets of items described above constitute the Resilience Analysis Grid (RAG). Inorder for the RAG to be useful as a tool, it is necessary that the answer to each item can berated or assigned a value. This can be done in a number of fashions. As a beginning, it isproposed that the rating is done according to the following scale:

• Excellent – the system/organisation meets and exceeds the criteria for the requiredcapability.

E. Hollnagel Page 14

Page 15: RAG – Resilience Analysis GridResilience Analysis Grid 30/12/08 RAG – Resilience Analysis Grid Technical Document prepared by the Industrial Safety Chair, January 2009 Introduction

Resilience Analysis Grid 30/12/08

• Satisfactory – the system/organisation fully meets all reasonable criteria for the requiredcapability

• Acceptable – the system/organisation meets the nominal criteria for the requiredcapability.

• Unacceptable – the system/organisation does not meet the nominal criteria for therequired capability.

• Deficient – there is insufficient capability to provide the required capability.

• Missing – there is no capability to provide the required capability or information is notavailable.

Consider, for instance, the first item for ‘ability to respond’: ‘What are the events for whichthe system has a prepared response.’ The rating can be made using the following guideline:

• Excellent – the system/organisation can respond to all imaginable events.

• Satisfactory – the system/organisation can respond to all reasonable events.

• Acceptable – the system/organisation can respond to events that occur frequently, orevents that are defined by a regulator.

• Unacceptable – the system/organisation can only respond to some of the events thatoccur frequently.

• Deficient – the system/organisation can only respond to the most critical events.

• Missing – there are no prepared responses or information is not available.

Similar guidelines to evaluate or rate responses can be developed for the other items. Once therating has been done for the 10 items that characterise the ‘ability to respond,’ it is possible toshow the ratings graphically using the star-diagram shown in Figure 4. The advantage of thisstyle of graphical representation is that it uses of a regular polygon to show the result of theevaluation. Any irregularities in the shape of the polygon will be easy to detect, and provide aclear signature of how well the system/organisation rates in regard to the ability to respond.

E. Hollnagel Page 15

Page 16: RAG – Resilience Analysis GridResilience Analysis Grid 30/12/08 RAG – Resilience Analysis Grid Technical Document prepared by the Industrial Safety Chair, January 2009 Introduction

Resilience Analysis Grid 30/12/08

Similar star-diagrams can be developed for the other abilities. In order to provide an overallimpression of the resilience of an organisation, it is necessary to specify a way to collapseeach star diagram, or rather the ratings of the individual items, to a single rating. When that isdone, it is possible to show a star diagram for the four components of resilience, i.e., theabilities to respond, monitor, anticipate, and learn. Figure 5 shows what a RAG star diagramwould look like for an organisation, where the combined ratings for all of the fourcomponents were ‘acceptable.’

E. Hollnagel Page 16

Figure 4: Star diagram for 'ability to respond'

Page 17: RAG – Resilience Analysis GridResilience Analysis Grid 30/12/08 RAG – Resilience Analysis Grid Technical Document prepared by the Industrial Safety Chair, January 2009 Introduction

Resilience Analysis Grid 30/12/08

The next two figures illustrate how other ratings would appear. Figure 6 shows the RAG for aorganisation that does well in terms of the ability to respond and monitor, but which fails interms of the ability to anticipate and learn. While such an organisation may be safe in theshort run, it is not resilient.

Figure 7 shows what a star diagram would like for a resilient organisation. The RAG revealsthat the organisation does very well in terms of the ability to respond and monitor. But inaddition it puts efforts into looking at the potential (‘anticipate’) and has an acceptable abilityto learn. The diagram also suggests how the resilience can be improved.

E. Hollnagel Page 17

Figure 5: RAG star diagram (showing acceptable ratings)

Page 18: RAG – Resilience Analysis GridResilience Analysis Grid 30/12/08 RAG – Resilience Analysis Grid Technical Document prepared by the Industrial Safety Chair, January 2009 Introduction

Resilience Analysis Grid 30/12/08

E. Hollnagel Page 18

Figure 6: RAG star diagram (show lack of resilience)

Figure 7: RAG star diagram for resilient organisation

Page 19: RAG – Resilience Analysis GridResilience Analysis Grid 30/12/08 RAG – Resilience Analysis Grid Technical Document prepared by the Industrial Safety Chair, January 2009 Introduction

Resilience Analysis Grid 30/12/08

SummaryThe Resilience Analysis Grid presented here is not proposed as a final tool that can be useddirectly. It is rather intended as a basis from which a more specific grid – or set of questions –can be developed. The questions must clearly be relevant for the organisation where they areintended to be used, and may therefore require clarification and reformulation.

The note has outline the principles for how the evaluations can be rated, and the star digramsimilarly suggest what constitutes an acceptable score. The star diagram is not in itself ameasure of resilience, but is a compact representation of how the various items have beenrated. It is also a process measure rather than a product measure, i.e., it shows the current stateof things – the current level of resilience and of how well the organisation does on each of thefour main capabilities.

The Resilience Analysis Grid can be used useful to determine the current state of resilience ofan organisation, and also to define the future position or objective. This does not follow fromthe questions themselves, but from the way in which they are evaluated. The ResilienceAnalysis Grid, is also useful to meet the third requirement, because the questions are derivedfrom an underlying theory of what resilience is.

Resilience engineering does not prescribe a certain balance or proportion among the fourqualities. But it does make clear that it is necessary for an organisation to address each ofthese qualities to some extent, in order to be resilient. This can be illustrated by the shape ofthe polygon in the star diagrams, provided appropriate rating rules and weights have been

E. Hollnagel Page 19

Figure 8: Implementation questions for resilience engineering

Page 20: RAG – Resilience Analysis GridResilience Analysis Grid 30/12/08 RAG – Resilience Analysis Grid Technical Document prepared by the Industrial Safety Chair, January 2009 Introduction

Resilience Analysis Grid 30/12/08

developed. All organisations traditionally put some effort into the ability to respond to theactual. Many also put some effort into the ability to learn from the factual, although it often isin a very stereotypical manner. Fewer organisations make a sustained effort to monitor thecritical, particularly if there has been a long period of stability. And very few organisations putany serious effort into the ability to anticipate the potential.

ReferencesAdamski, A. & Westrum, R. (2003). Requisite imagination. The fine art of anticipating whatmight go wrong. In E. Hollnagel (Ed.), Handbook of cognitive task design (pp. 193-220).Mahwah, NJ: Lawrence Erlbaum Associates.

Amalberti, R. (2002), Revisiting safety and human factors paradigms to meet the safetychallenges of ultra complex and safe systems. In B. Wilpert & B. Fahlbruch (Eds.), Systemsafety: Challenges and pitfalls of intervention. Pergamon.Comfort, L. K. & Haase, T. W. (2006). Communication, coherence and collective action: theimpact of Hurricane Katrina on communications infrastructure. Public Works Management &Policy, 11(1), 6-16.

Dekker, S. W. A. & Woods, D. D. (1999). To intervene or not to intervene: The dilemma ofmanagement by exception. Cognition, Technology & Work, 1(2), 86-96.

European Technology Platform on Industrial Safety (ETPIS) (2005). Safety for SustainableEuropean Industry Growth: Strategic Research Agenda. www.industrialsafety-tp.org

Hollnagel, E. (1998). Cognitive reliability and error analysis method. London, UK: Elsevier.

Hollnagel, E. (2004). Barriers and accident prevention. Aldershot, UK: Ashgate.

Hollnagel, E. (2008). From protection to resilience: Changing views on how to achieve safety.Proceedings of the 8th International Symposium of the Australian Aviation PsychologyAssociation, April 8-11, Sydney, Australia.

Hollnagel, E. & Speziali, J. (2008). Study on developments in accident investigation methods:A survey of the “state-of-the-art” (SKI 2008:50). Stockholm, Sweden: Swedish NuclearInspectorate.ICAO (International Civil Aviation Organisation), (2006). Safety management manual.Montreal, Canada: Document sales unit.

Tversky, A. & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases.Science, 185, 1124-1131.

Westrum, R. (1993). Cultures with requisite imagination. In J. A. Wise, V. D. Hopkin & P.Stager (Eds.), Verification ad validation of complex systems: Human factors issues. Berlin:Springer-Verlag.

E. Hollnagel Page 20